
Citation: Maryam Gholamniya Foumani, Narges Sadeghi, Shadi Dehghanzadeh. Self-esteem and self-confidence relationship with religious tendency in families with a child suffering from cancer[J]. AIMS Medical Science, 2019, 6(3): 218-229. doi: 10.3934/medsci.2019.3.218
[1] | Eric Ke Wang, Nie Zhe, Yueping Li, Zuodong Liang, Xun Zhang, Juntao Yu, Yunming Ye . A sparse deep learning model for privacy attack on remote sensing images. Mathematical Biosciences and Engineering, 2019, 16(3): 1300-1312. doi: 10.3934/mbe.2019063 |
[2] | Jing Zhou, Ze Chen, Xinhan Huang . Weakly perceived object detection based on an improved CenterNet. Mathematical Biosciences and Engineering, 2022, 19(12): 12833-12851. doi: 10.3934/mbe.2022599 |
[3] | Tingxi Wen, Hanxiao Wu, Yu Du, Chuanbo Huang . Faster R-CNN with improved anchor box for cell recognition. Mathematical Biosciences and Engineering, 2020, 17(6): 7772-7786. doi: 10.3934/mbe.2020395 |
[4] | Amsa Shabbir, Aqsa Rasheed, Huma Shehraz, Aliya Saleem, Bushra Zafar, Muhammad Sajid, Nouman Ali, Saadat Hanif Dar, Tehmina Shehryar . Detection of glaucoma using retinal fundus images: A comprehensive review. Mathematical Biosciences and Engineering, 2021, 18(3): 2033-2076. doi: 10.3934/mbe.2021106 |
[5] | Auwalu Saleh Mubarak, Zubaida Said Ameen, Fadi Al-Turjman . Effect of Gaussian filtered images on Mask RCNN in detection and segmentation of potholes in smart cities. Mathematical Biosciences and Engineering, 2023, 20(1): 283-295. doi: 10.3934/mbe.2023013 |
[6] | Dawei Li, Suzhen Lin, Xiaofei Lu, Xingwang Zhang, Chenhui Cui, Boran Yang . IMD-Net: Interpretable multi-scale detection network for infrared dim and small objects. Mathematical Biosciences and Engineering, 2024, 21(1): 1712-1737. doi: 10.3934/mbe.2024074 |
[7] | Shuai Cao, Biao Song . Visual attentional-driven deep learning method for flower recognition. Mathematical Biosciences and Engineering, 2021, 18(3): 1981-1991. doi: 10.3934/mbe.2021103 |
[8] | Jianzhong Peng, Wei Zhu, Qiaokang Liang, Zhengwei Li, Maoying Lu, Wei Sun, Yaonan Wang . Defect detection in code characters with complex backgrounds based on BBE. Mathematical Biosciences and Engineering, 2021, 18(4): 3755-3780. doi: 10.3934/mbe.2021189 |
[9] | Xin Shu, Xin Cheng, Shubin Xu, Yunfang Chen, Tinghuai Ma, Wei Zhang . How to construct low-altitude aerial image datasets for deep learning. Mathematical Biosciences and Engineering, 2021, 18(2): 986-999. doi: 10.3934/mbe.2021053 |
[10] | Eric Ke Wang, Fan Wang, Ruipei Sun, Xi Liu . A new privacy attack network for remote sensing images classification with small training samples. Mathematical Biosciences and Engineering, 2019, 16(5): 4456-4476. doi: 10.3934/mbe.2019222 |
Object detection (OD) [1,2,3,4,5,6,7,8] is an essential task that forms the basis of many other computer vision tasks, such as object tracking [9,10], instance segmentation [11,12,13], action recognition [14,15,16,17,18], environment surveillance [19,20,21], video checking in sports [22,23], scene understanding [24,25,26,27,28], etc. Thanks to the powerful feature-learning ability of deep convolutional neural networks (CNNs) [29,30,31,32,33], object detection research has been experiencing rapid growth over the last decade. Deep learning-based object detection techniques can be divided into two categories: one-stage models and two-stage models. The successful two-stage models include the R-CNN [2], Fast R-CNN [3], Faster R-CNN [4], SPP-Net [34] and feature pyramid network (FPN) models [35]. These models generate a region of interest (ROI) in the first stage; they then fine-tune the ROI to classify the objects and localize the bounding box in the second stage. YOLO [6], SSD [36] and several anchor-free models, including feature selection anchor-free module (FSAF) [37], CornerNet [38], FCOS [39] and CenterNet [40], are one-stage models that directly classify and localize the object from the feature map without completing the ROI stage.
Small object detection (SOD) [41] is an emerging research area within object detection. SOD has been widely used in medical image analysis, maritime rescue, face recognition in surveillance video, drone scene analysis and others. Many promising deep learning-based SOD works have been published in recent years. Small objects can be defined in two major ways. One definition method is relative size [42], where the ratio of the object's bounding box's width and height to the image's width and height is less than 0.1, or the ratio of the object's bounding box's area to the image's area is less than 0.03; the other definition method is absolute size, where the COCO [43] dataset indicates that an object is small if its size is less than 32 × 32 pixels. Examples are shown in Figure 1. These definitions mean that the visual feature of a small object is limited.
Though large-scale datasets, such as Microsoft Common Objects in Context (MS COCO) [43], ImageNet [44], and PASCAL VOC [45] and, have contributed to the growth of object detection methods, these methods fail to accurately detect small objects. Taking Co-DETR [46], i.e., one of the state-of-art methods, as an example, the mean average precision (mAP) metric of small objects on COCO obtained by Co-DETR was only 48.4%, which significantly lags behind that of objects with medium and large sizes (67.1 and 77.3% respectively). The main reason for the poor performance for SOD is that small objects have lower resolution and occupy fewer pixels than larger objects; the spatial position information loss by performing down-sampling and a pooling operation in the convolutional networks makes it more challenging for the detection head to locate the small objects. The large scarcity of small object datasets is another obstacle to the advancement of SOD. Existing small object datasets mainly concentrate on specific scenarios; see [47] for human faces, [48,49,50,51] for pedestrians and [52,53,54,55,56] for traffic scenes; networks trained on them are unsuitable for general SOD. To overcome these challenges, researchers have developed a series of strategies to improve the performance of SOD. We summarize these techniques from the perspectives of boosting the resolution of input features, scale-aware training, incorporating contextual information and data augmentation. An in-depth comparative performance evaluation of these methods on well-known datasets is also used to draw meaningful conclusions.
As shown in Table 1, recent object detection reviews aim to present progress in the area of deep learning-based object detection. Zhao et al. [57] reviewed the deep learning-based object detection frameworks and mentioned the difficulties of SOD. A large-scale, freely accessible benchmark (DIOR) for object detection in optical remote sensing images has been proposed by Li et al. [58]. A thorough analysis of the imbalance issues in object detection has been provided by Oksuz et al. [59]. To improve the effectiveness of object detection-purposed deep learning-based approaches, researchers working at the intersection of deep learning and computer vision are developing multidisciplinary solutions. Current approaches suggested to deal with the issue of class-incremental object detection have been examined by Menezes et al. [60]. However, these reviews mainly focus on the object detection of regular-sized rather than small-sized objects.
Title | Publication | Strengths | Limitations |
Object detection with deep learning: A review [57] | TNNLS 2019 | It reviews the deep learning-based object detection models and the difficulties of SOD. | These reviews offer a thorough summary of object detection. However, they concentrate on regular-sized object detection rather than small objects. |
Object detection in optical remote sensing images: A survey and a new benchmark [58] | ISPRS 2020 | It constructs DIOR, a large dataset of remote sensing. | |
Imbalance problems in object detection: A review [59] | arXiv 2020 | It reviews the imbalance problem of object detection. | |
Continual object detection: A review of definitions, strategies, and challenges [60] | arXiv 2022 | This survey investigates continual object detection. | |
New generation deep learning for video object detection: A survey [61] | TNNLS 2022 | It systematizes the latest video object detection models and analyzes the performance of these models on two datasets. | |
A survey of deep learning-based object detection [62] | IEEE Access 2022 | It reviews detection methods, general datasets and typical applications. | |
A survey of the four pillars for small object detection: Multi-scale representation, contextual information, super-resolution, and region proposal [64] | TSMC 2020 | It discusses the four pillars of SOD and reports on the performance of SOD on three datasets. | These studies do not contain a complete assessment of the most recent SOD approaches. |
Recent advances in small object detection based on deep learning: A review [65] | IVC 2020 | It reviews the SOD from five perspectives and analyzes the evaluation results for two general datasets. | |
A survey and performance evaluation of deep learning methods for small object detection [66] | ESWA 2021 | The solutions are summarized for the four challenges of SOD and some experiment analyses are provided. | It only analyzes the performance of three classical object detection algorithms (Faster R-CNN, SSD, YOLO). |
Deep learning-based detection from the perspective of small or tiny objects: A survey [67] | IVC 2022 | Aims to discuss small- or tiny-object datasets, detection techniques and the performance of these techniques. | These surveys systematically reviewed the development of SOD. Nevertheless, they all lack a comprehensive review of techniques deliberately designed for critical SOD tasks. |
A guide to image and video based small object detection using deep learning: Case study of maritime surveillance [68] | arXiv 2022 | Reviews the SOD methods and investigates the performance of SOD in maritime environments. | |
Towards large-scale small object detection: Survey and benchmarks [69] | arXiv 2022 | It presents a detailed study of SOD and yields two large-scale benchmarks for a driving scenario and aerial scene. | |
Deep learning based small object detection: A survey | Ours | We comprehensively discuss the definition of small objects, the challenges encountered in detecting small objects, the strengths and weaknesses of generic SOD algorithms and three crucial SOD tasks. We also analyze the performance of SOD on three datasets and summarize meaningful conclusions. |
There are also recent surveys on SOD. In the review by Chen et al. [63], the four pillars of SOD are discussed. However, they did not connect the basic module design of the detector to the challenges in SOD; rather, they only reviewed studies on SOD from the viewpoint of the model framework (Figure 2), such as MMDetection [64], which divides the framework of the detector into a backbone, neck and head. The current SOD methods based on deep learning have been reviewed by Tong et al. [65] from five perspectives; they analyzed the evaluation results for two general datasets. Tong et al. limited their work to generic SOD and did not consider a model developed for SOD tasks. In addition to summarizing and contrasting current deep learning approaches for SOD, Liu et al. [66] also provided a brief overview of related methods, such as traditional object detection, face detection, picture segmentation and remote sensing images. However, they only evaluated the performance of a few networks: Faster R-CNN, SSD, YOLO and SSD. Partial performance evaluation cannot illustrate the broad picture of SOD. Tong and Wu refined and differentiated between small and tiny objects [67]. To examine this expanding field, Rekavandi et al. [68] presented a thorough analysis of more than 160 research publications released between 2017 and 2022. Other significant works include that by Cheng et al. [69], who constructed two large-scale SOD datasets, SODA-D and SODA-A, focusing on the driving and aerial scenarios, respectively. In contrast to these earlier object detection surveys, we focus on the difficulties related to SOD, investigate recent deep learning-based SOD algorithms and thus present a taxonomy to illustrate the novel strategies developed to improve SOD performance. In addition to providing an in-depth description of deep learning-based SOD algorithms developed in three areas, our study also offers meaningful comparisons of the associated experimental results.
In summary, our contributions are as follows:
1) Systematic overview of deep learning-based SOD algorithms. We analyze state-of-the-art deep learning-based SOD algorithms in accordance with the challenges in SOD, and we provide a taxonomy that summarizes the strategies for improving SOD performance from the perspective of boosting the resolution of input features, scale-aware training, incorporating contextual information and data augmentation. Additionally, we provide a thorough review of the methods of crucial SOD tasks, including small face detection, small pedestrian recognition and aerial image detection.
2) Performance evaluation of SOTA deep learning-based SOD algorithms. We not only analyze the performance of generic SOD methods with the general large-scale dataset, but we also evaluate the performance of state-of-the-art SOD methods on three crucial SOD tasks, including small face detection, small pedestrian detection and aerial image detection.
3) Finally, according to the taxonomy methods and performance analysis of SOD, we discuss potential directions for future research, including suitable metrics for SOD optimization, weakly supervised SOD methods, multi-task joint optimization, and open world or few-shot SOD.
The remainder of the paper is organized as follows. The generic SOD algorithms are discussed in Section 2. Section 3 summarizes the methods proposed for three SOD tasks. We provide datasets and evaluation metrics in Section 4 and evaluate generic SOD methods, small face, small pedestrian, and aerial image SOD methods. Future directions are discussed in Section 5. Finally, the conclusion of this paper is presented in Section 6.
In this section, we will extensively review the methods of generic SOD. To deal with the challenges of SOD, existing SOD methods typically have complex designs added to the current pipeline that excels at generic object detection. We will describe these methods from four perspectives, including boosting the resolution of input features, scale-aware training, incorporating contextual information and data augmentation. The advantages and disadvantages of each perspective, as shown in Tables 2–6, are then discussed in detail.
Method | Publication | Techniques | Strengths | Weaknesses |
SSD [36] | ECCV16 | Pyramidal feature hierarchy without fusing features. | SSD can detect objects of various sizes. | The low-level prediction feature map has no strong semantics. |
FPN [35] | CVPR17 | Feature pyramid network (including feature fusion and multi-scaled fusion modules, etc.). | FPN dramatically improves the detection accuracy of small objects. | The feature representation capability will be diminished by the semantic gap between feature layers of various scales. |
RetinaNet [74] | ICCV17 | RetinaNet alleviates the foreground-background class imbalance problem. | ||
FSSD [75] | arXiv17 | Lightweight feature fusion module. | ||
MDSSD [76] | arXiv18 | MDSSD incorporates contextual information that is more conducive to SOD. | Lower detection speed than SSD. | |
[77] | CVPR21 | FPN with a learnable fusion factor. | The fusion factor further improves FPN performance for small objects. | |
HRDNet [78] | arXiv20 | FPN with image pyramid. | HRDNet acquires more details for small objects with high resolution. | Large numbers of parameters. |
IPG-Net [79] | CVPR20 | IPG-Net alleviates the vanishment of the small object features. | This method is inefficient. | |
RHF-Net [80] | CVPR20 | Recursive hybrid fusion pyramid network. | Low computational cost and high accuracy. | |
QueryDet [81] | CVPR22 | Query mechanism. | Accelerating inference with sparse query. | |
EFPN [82] | TMM22 | Super-resolution (include super-resolution layer and enhancing representations of small objects to be similar to large ones). | EFPN adds a high-resolution layer to FPN to increase the accuracy of the SOD. | Super-resolution feature extraction leads to more computational costs. |
[83] | CVPR17 | The GAN-based approaches effectively enhance the level of detail of information on small objects. | ||
MTGAN [84] | ECCV18 | |||
TPS [85] | ICCV19 | |||
MRAE [86] | CVPR22 | FPN with attention weight. | It provides a practical solution for multi-resolution feature extraction without using a GAN, and it is time-efficient. |
Method | Publication | Techniques | Strengths | Weaknesses |
SNIP [95] SNIPER [96] |
CVPR18 | Scale normalization training strategy for image pyramids. | SNIP and SNIPER can effectively improve the detection performance of small objects. | It requires an input image pyramid that brings a high computational cost. |
SAN [97] | CVPR18 | Scale-aware training. | SAN makes the network more robust against scale invariance. | |
Trident [98] | ICCV19 | Multi-branch architecture and scale-aware training. | Multi-branch technique makes the receptive field size align with the object size. | It may bring about the over-fitting problem in each branch, as caused by too few effective samples. |
POD [99] | ICCV19 | Global scale learning. | This method makes the network more sensitive to scale invariance. |
Method | Publication | Techniques | Strengths | Weaknesses |
ION [101] | CVPR16 | Integrate contextual information. | ION exploits context and multi-scale representations to improve SOD. | Underutilization of early feature layers. |
DSSD [102] CSSD [103] |
arXiv17 WACV18 |
Fusing contextual information in different ways to improve SOD performance. | Slower detection speed than SSD. | |
SMN [104] | ICCV17 |
Spatial memory for contextual reasoning. | SMN models the instance-level context to improve the performance of SOD. | The gradient will vanish as the reasoning signal and perceptual signal cancel each other out. |
IRR [105] | arXiv20 | Contextual reasoning integrates intrinsic relations. | IRR updates the initial regional features to boost SOD. | Small objects are associated with difficulty in extracting semantic features. |
FA-SSD [106] | ICAIIC21 | Context with attention. | FA-SSD is more accurate than SSD. | It has lower accurate than DSSD. |
Method | Publication | Techniques | Strengths | Weaknesses |
Kisantal et al. [110] | arXiv19 | Oversampling and random copy-pasting. | This approach achieves better object detection accuracy for small objects. | Random copying and pasting may cause background mismatch. |
RRNet [111] | ICCV19 | Adaptive resampling augmentation strategy. | Free-anchor and adaptive resampling result in excellent performance for very small objects. | |
Ünel et al. [112] | CVPR19 | Tiling-based augmentation. | The method provides a good trade-off between accuracy and time cost. | |
DST [113] | arXiv20 | Uses the feedback information to guide data preparation. | Feedback-driven and dynamic data preparation paradigms mitigate the scale-invariant issue. | |
Zoph et al. [114] | ECCV20 | Automatic data augmentation | This approach has no additional inference cost and minimal training cost. | The strategy is intricate. |
Chen et al. [116] | CVPR21 | It can be transferred to other datasets and tasks and is scale-sensitive. | The high time cost of auto-augmentation approaches for searching. |
Method | Publication | Techniques | Strengths | Weaknesses |
PPDet [117] | BMVC20 | Anchor-free with a new label strategy. | It reduces the contributions of non-discriminatory features during training. |
|
CenterNet++ [118] | CVPR21 | An anchor-free detector that uses triplet key points to represent objects. | This model with multi-resolution performs better. | |
NWD [119] | arXiv21 | A new metric to replace IoU. | These two metrics are more effective than the IoU metric for small object detection. | |
RFLA [120] | ECCV22 | |||
C3Det [121] | CVPR22 | Annotation framework for tiny objects. | It alleviates the expense of tiny-object annotation. | |
SAHI [122] | arXiv22 | Slicing-aided inference. | This scheme is plug-and-play, does not require pre-training and improves the accuracy of detecting small objects. | Larger feature maps require more memory and computing cost. |
The difficulty in precisely locating small objects is mainly due to the down-sampling operation of the CNN, which causes the features of small objects to disappear, and the low spatial resolution of the high-level feature map seriously loses the spatial position information of small objects. A fairly rational solution to that is to use high-resolution feature maps or high-resolution images. However, employing high-quality images or increasing the feature map resolution will result in higher computing costs. Numerous scholars have constructed feature pyramids by reusing multi-scale feature maps produced by network forward propagation, followed by the use of low-level high-resolution feature maps with more minute spatial details to detect small objects. Additionally, some models have learned the mapping function from low-resolution features to high-resolution features to achieve the same detection effect as large objects. Both approaches substantially increase the resolution of the predictive feature layer. Several typical models that boost the resolution of input features are shown in Figure 3.
SSD [36] is a multi-scale object detection technique that detects objects by placing reference windows of different scales in different layers of the networks. The detection accuracy of small objects has not greatly improved. The primary explanation is that low-level feature maps have a limited receptive field and a significantly poorer ability to represent features than deep feature maps. Therefore, Lin et al. proposed FPNs [35]. The core idea behind FPNs is to use forward propagation of the network to create four feature maps of different scales, merge the high-level feature maps with the lower-level feature maps through layer-by-layer up-sampling, fuse the features from different network depths to achieve feature enhancement and then make predictions by using the fused feature map that each layer needs to only predict one scale of objects. The results of the experiments show that the FPN significantly increases SOD accuracy and can guarantee a detection speed of 6 FPS. Since the FPN was proposed, numerous enhanced variants have been developed, including the PANet [70], BiFPN [71], ASFF [72], NAS-FPN [73], etc. The object-proposal-based detection technique has long had a modestly better detection accuracy, despite the integrated convolutional network-based detection model having a significantly faster detection speed. After investigating the reasons behind this, Lin et al. presented RetinaNet [74]. The one-stage network initially outperformed the two-stage network. Lin et al. argued that the foreground-background class imbalance mostly accounts for the integrated convolutional network's inferior detection performance. So, focal loss was proposed to improve cross-entropy loss. Focal loss is given by Eq (1):
$ FL\left(p\right) = \left\{−α(1−p)γlog(p)ify=1−(1−α)pγlog(1−p)otherwise \right. $
|
(1) |
$ \alpha $ is the balancing variant; $ p\in \left[\mathrm{0, 1}\right] $ stands for the probability when y = 1 (positive sample). The rate at which simple examples are down-weighted is adjusted by the focusing parameter $ \gamma \left(\gamma \ge 0\right) $. RetinaNet can achieve the "focus" of hard samples and the redistribution of network learning ability by reducing the learning weight of simple background samples during the network training process. The lightweight feature fusion module proposed for FSSD [75] uses down-sampling to create a new feature pyramid. MDSSD [76] involves applying deconvolution to a high-level feature map with both powerful semantic information and then fusing it with low-level feature maps by using the fusion module to preserve rich spatial details and high feature representation capabilities for small objects. The architectures of RetinaNet and MDSSD are shown in Figure 3.
At the last layer of a backbone, small object features have almost disappeared. The top-down path makes it nearly impossible for FPNs to fuse the features of small objects. Additionally, as the network gets deeper, the deep feature map gains more semantic information but loses out on spatial information. This causes an offset between anchors and convolutional features, meaning that, after several convolutions, the position of the anchor on the deep feature map differs from the position on the original map. Additionally, the deep features and shallow features cannot be effectively aligned by the FPN fusion. Gong et al. [77] proposed a fusion factor for describing the coupling degree of adjacent layers in FPNs which can be calculated by using the dataset statistical data or learned through implicit learning. Adjusting the fusion factor of adjacent layers in an FPN can adaptively drive the shallow layers to focus on learning tiny objects, thus improving the detection of tiny objects. The high-resolution detection network (HRDNet) [78] accepts multiple resolution inputs via multi-depth backbones. To cut down on computational costs, the multi-depth image pyramid network (MD-IPN) uses a multi-depth backbone to output multi-scale, multi-level feature maps, which means that high-resolution input will be fed into a shallow network to reserve more positional information, and that low-resolution data will be fed into a deep network to extract more semantics. Multi-scale FPNs align and fuse multi-scale feature groups produced by an MD-IPN to decrease the information mismatch between these multi-scale, multi-level features. Liu et al. [79] proposed the IPG-Net to mitigate the disappearance of small object features following serial down-sampling and the dislocation between spatial information and semantic information; it includes an IPG transformation and IPG fusion module. IPG-Net receives an image pyramid as the input; the IPG transformation module extracts shallow features from image pyramids of various resolutions that include rich spatial information and detailed information; the IPG fusion module fuses the shallow features extracted by the IPG transformation module and the deep features of the backbone. RHF-Net [80] applies top-down and bottom-up feature fusion. It contains a recursive execution of the hybrid fusion module that enables RHF-Net to both connect high-level semantic features to the low-level features (top-down direction) and reshape the rich spatial features of low-level feature maps to the deeper layer (bottom-up direction), thus improving the contextual features of objects of all scales.
The spatial distribution of small objects on the high-resolution feature map of the feature pyramid is very sparse, accounting for only a small part of the high-resolution feature map. QueryDet [81] uses the query technique to accelerate the reasoning speed of the object detector based on the feature pyramid by preventing the detection head from doing resource-intensive calculations on the entire high-resolution feature map. It includes a query head in parallel with classification and regression to predict the locations (query keys) of a possible small object in the features of the previous layer. The current layer uses these locations to generate a sparse value feature map (query value). Then, it predicts the query keys of this layer to be given to the following layer.
Super-resolution is another effective method that directly enriches the information of small objects by increasing the resolution of the input image. EFPNs [82] add a super-resolution layer to an FPN, as it uses the feature texture transfer module to super-resolve features by extracting regional texture features from the reference features. This adds convincing details to the EFPN and improves the accuracy of SOD. To eliminate the representational disparity between large and small objects and allow a small object to attain the same detection accuracy as large objects, Li et al. [83] used a GAN to enhance the small object's feature representation to a super-resolved representation. But, the super-resolved feature might not be convincing, as the large object image and the small object image are not from the same image. The SOD-MTGAN [84] learns the mapping between low-resolution image patches and high-resolution image patches, which reduces the computational cost. Noh et al. [85] used high-resolution features for direct supervision. And, under the guidance of a super-resolution discriminator, low-resolution features are transferred to the super-resolution feature generator to generate high-resolution features. MARE [86] uses a network to obtain attention weights, which are considered as weights for each level of feature maps, to generate the final attention feature maps; it then performs feature fusion to further enhance the information that is useful for small targets. The EESRGAN [87] adds edge-enhanced sub-networks (EENs) [88] to the ESRGAN [89]. EENs perform edge enhancement on the intermediate super-resolution (ISR) images generated by the generator to produce the final super-resolution image. Together, the discriminator and detector perform the role of the discriminator, and the discriminator trains the generator by using relativistic loss [90]. The following Eqs (2) and (3) show the relativistic loss of the discriminator and the adversarial loss [91] of the generator. Where $ {D}_{Ra} $ is the probability that a real image ($ {I}_{HR} $) is relatively more realistic than a generated intermedia image ($ {I}_{ISR} $), $ {E}_{{I}_{SR}} $ is the operation that calculates the average of all generated intermediate images in a mini-batch, and $ {E}_{{I}_{HR}} $ is the operation that calculates the average of all real images in a mini-batch. Additionally, the EESRGAN employs end-to-end training to backpropagate the detector loss to the generator. Thus, the generator receives gradients from both the detector and the discriminator to enhance the quality of super-resolution images. Cao et al. proposed the MHN [92], which splits the network into three distinct branches (branch-l, branch-m, branch-s), where each branch produced equivalent high-level semantic feature maps with a variety of resolutions, allowing it to better match objects of various scales.
$ {L}_{G}^{Ra} = -{E}_{{I}_{HR}}\left[log\left(1-{D}_{Ra}\left({I}_{HR}, {I}_{ISR}\right)\right)\right]-{E}_{{I}_{SR}}\left[log\left({D}_{Ra}\left({I}_{ISR}, {I}_{HR}\right)\right)\right] $ | (2) |
$ {L}_{D}^{Ra} = -{E}_{{I}_{HR}}\left[log\left({D}_{Ra}\left({I}_{HR}, {I}_{ISR}\right)\right)\right]-{E}_{{I}_{SR}}\left[log\left(1-{D}_{Ra}\left({I}_{ISR}, {I}_{HR}\right)\right)\right] $ | (3) |
The largest object in the COCO dataset is 20 times larger than the smallest, and the scale invariance of CNNs is not robust against such large-scale variances. Scale-aware training strategies can make the detector more robust against scale variance. A common process of the scale-aware training model is shown in Figure 4.
Previously proposed approaches use image pyramids [93,94] to improve the accuracy of object detection at various scales, which have larger memory requirements. Scale normalization for image pyramids (SNIP) [95] is a training strategy that uses the image pyramid training model and only backpropagates the loss of object size within the predetermined range. To go further, SNIPER [96] chooses chips with a fixed resolution of 512 × 512 pixels from each layer of the pyramid to act as the training unit, unlike SNIP, which analyzes every pixel in an image. Because of the smaller chip resolution, it can train with a larger batch size, which improves both training efficiency and detection accuracy. Kim et al. proposed a scale-aware network (SAN) [97] that maps the convolutional features from the different scales onto a scale-invariant subspace to make CNN-based detection methods more robust against the scale variation, and also to construct a unique learning method that purely considers the relationship between channels without the spatial information for the efficient learning of the SAN. This method essentially improves the quality of convolutional features in the scale space and can be generally applied to many CNN-based detection methods to enhance the detection accuracy with a slight increase in the computing time.
Trident [98] is a multi-branch parallel network, where each branch adopts an appropriate dilated ratio to provide the receptive field size that can align with the object size. Moreover, a scale-sensitive training approach is applied to enhance each branch's capacity for scale perception and prevent the training of objects of extreme scale on branches with unmatched receptive fields. Each branch's effective range, l, is given by Eq (4):
$ {l}_{i}\le \sqrt{wh}\le {u}_{i} $ | (4) |
Peng et al. [99] show that the local and dense continuous scales which are hard to optimize are not necessary, and that, through a collaboration of well-learned global scales on layers, a network could be granted the scale-awareness. Therefore, they designed a global scale learning module to replace the normal convolutional module and learn the appropriate global scale for different layers.
Visual objects frequently coexist with other relevant objects in a certain setting, which provides rich contextual associations to be exploited. Researchers [100] have shown that utilizing the context as extra information can help to detect small objects with obscure features. Two typical models of incorporating contextual information are shown in Figure 5.
Chen et al. [42] extended the R-CNN model by using ContextNet and a small region proposal generator to improve SOD. Regarding the region proposal network (RPN), Chen et al. used smaller RPN anchor sizes (162, 402, 1002 vs. 1282, 2562, 5122). ContextNet integrates contextual information to calculate the final classification score. Bell et al. [101] proposed ION, which utilizes information inside and outside of the ROI to improve detection performance. Regarding the inside part, ION extracts the features of the ROI at several levels at different scales by using skip pooling to enhance the ability to detect small objects. Regarding the outside part, ION extracts the contextual information outside of the ROI by using a spatial recurrent neural network to enhance the feature information and promote the subsequent classification and regression performance. The DSSD [102] fuses deep semantic information as context with shallow semantic information. The CSSD [103] is a context-aware framework that incorporates context by integrating deconvolutional or dilated convolutional layers into SSD. In object detection, there are two common contexts. Image-level context refers to modeling the contextual information of each pixel in the whole image, which is implicitly incorporated into the deep convolutional network, while the instance-level context, which models object-object relationships, is an important clue for object detection and reasoning. A spatial memory network (SMN) [104] was proposed to get the instance-level context. The network detects an object, remembers it and then uses it as a priori knowledge to help detect the previously missed target in the next iteration. Fu et al. [105] introduced a unique contextual reasoning method for SOD that models and infers the relationships between objects' inherent semantic and spatial layouts. The learnable semantic association functions are defined by the semantic module from the standpoint that proposals belonging to the same category share semantic co-occurrence information. The formula is given by Eq (5):
$ {s}_{ij}^{\text{'}} = \sigma \left(i, j\right).f\left({p}_{i}^{o}, {p}_{j}^{o}\right) = \sigma \left(i, j\right).{\rm{ \mathsf{ ϕ} }(\mathrm{p}}_{\rm{i}}^{\rm{o}}){{\rm{ \mathsf{ ϕ} }(\mathrm{p}}_{\rm{j}}^{\rm{o}})}^{\rm{T}} $ | (5) |
where $ \sigma \left(i, j\right) $ denotes an indicator function and $ \phi $ maps the initial region features $ {p}_{i}^{o} $ to latent representations. The spatial layout module disregards semantic similarity and builds relationships based on spatial similarity and spatial distance in the internal spatial layout, allowing small objects that have a high degree of spatial similarity and appear in clusters to communicate contextual information about the spatial layout to one another. FA-SSD [106] is a combination of F-SSD and A-SSD. F-SSD uses higher-level feature maps as context to concatenate with low-level feature maps. A-SSD uses an attention mechanism to minimize unnecessary shallow features in the background. Both image-level context and instance-level context are commonly used by SOD.
High-quality large-scale datasets can greatly improve the performance of deep learning SOD. However, the amount of labeled data is still far from sufficient due to the high cost of annotation. Data augmentation is a common method to enrich the diversity of the dataset, thus improving the generality and robustness of the model to some extent. This can also help to mitigate the degradation of object detection accuracy due to the uneven distribution of different scale objects in the dataset.
A lot of data augmentation techniques have been developed, such as affine transformation, Mosaic [107], MixUp [108] and CutMix [109], but these methods have better performance on medium- or large-sized objects than small objects. Kisantal et al. [110] thoroughly investigated the MS COCO dataset and discovered a sample imbalance problem: images with small objects in the dataset are only a small fraction; particularly, the number of small objects in each image is less and the site of occurrence lacks diversity. Kisantal et al. proposed oversampling images with small objects to increase the number of small objects during training. Chen et al. [111] found that random copying and pasting led to background mismatch and object size mismatch. To solve that, they employed adaptive data augmentation, which uses a semantic segmentation network to obtain an a priori roadmap and samples an effective position to place the object enhanced by the roadmap. Ünel et al. [112] proposed a tiling-based technique where the input images are deliberately split into overlapping tiles to increase the relative pixel area of small objects.
To address scale variance, DST [113] receives the loss proportion caused by small objects as feedback. If the loss proportion is smaller than the predetermined threshold, the training images are enlarged and spliced in the following iteration to compensate for the missing small objects. Zoph et al. [114] used AutoAugment [115] to find the optimal data augmentation method for object detection by applying an augmentation strategy search to the training set. An RNN controller and a reinforcement learning methodology are included in the search strategy. Chen et al. [116] proposed scale-aware automatic data augmentation, which includes a scale-aware search space with augmentations at the image and box levels, as well as a search metric called the Pareto scale balance. The metric is realized by recording accumulated loss and accuracy over various scales.
Samet et al. [117] proposed a new labeling technique in which the predictions derived from individual features are aggregated into one prediction to reduce the labeling noise of the anchor-free detector. Duan et al. proposed CenterNet++ [118], which uses a triplet of a center key point and a pair of corners to represent an object. The corners can locate objects with any geometry. Wang et al. [119] evaluated the sensitivity of the Intersection over Union (IoU) to position variations of small objects, and they suggest replacing the IoU with a new measuring technique that models each box as a Gaussian distribution and uses the normal Wasserstein distance (NWD) to determine the similarity of the two distributions to one another. Xu et al. [120] presented receptive field distance to quantify the similarity between the Gaussian receptive field and ground truth directly, rather than assigning samples with IoU sampling strategies. C3Det [121], an interactive, multi-class, tiny-object annotation framework that Lee et al. suggested, allays concerns about the demands and expense of annotation in the actual world. SAHI [122] entails dividing the input images into overlapping slices to yield a higher percentage of small objects in the image of the input network.
In this section, we present a systematic review of SOD in terms of small face detection, small pedestrian detection and aerial image detection tasks. We first thoroughly describe the current approach to each task. Then, a comprehensive summary of the strengths and weaknesses of each method is presented.
Multi-scale modeling [123] was proposed following a thorough investigation of image resolution, object scale variation and contextual information. This algorithm uses SSD as the foundation, and the sparse discrete image pyramid is fused to handle the scale shift of objects. Rich contextual information is necessary for SOD, but low-level feature maps are used because SOD lacks semantic information; however, deep feature maps contain rich contextual and semantic information. As a result, multi-layer feature fusion is incorporated into SOD, which enhances the performance of small face detection. S3FD [124] incorporates a scale-equitable face detection network to adapt face detection at various scales. Additionally, the effective receptive field and equal-proportion interval principles are used to define the scales of the anchors, ensuring that different scales of anchors are distributed uniformly across the image, and that anchors at different layers match their corresponding effective receptive fields. Then, by using a scale compensation anchor-matching approach, the recall rate of small faces is increased. Lastly, the false positive rate of small faces is decreased by predicting the number of background anchors for each matched small anchor. [125] uses generative adversarial network to generate high-resolution face. Face-MagNet [126] employs ConvTranspose (kernel = 8, stride = 4) layers that pass the features of small faces from the lower feature layer to the prediction layer inside an RPN and classifier to magnify the feature maps for the better detection of small faces.
Zhu et al. [127] pointed out that the anchor-based face detector does not process small faces well because the anchor and small faces cannot overlap perfectly, so it is difficult to adjust the anchor to be close to ground truth. Therefore, Zhu et al. proposed an expected max overlapping (EMO) score, which improves the ability of the anchor and face to obtain a high IoU. And, by increasing the number of small-scale anchors, it enhances the likelihood of matching a face. Additionally, to get a high IoU for these faces with the anchor, the algorithm randomly moves the face positions during training. Finally, a compensation strategy of anchor matching was also proposed to improve the chance of detecting hard faces. TinaFace [128] involves modifications to RetinaNet, and it achieved a 92.4% average precision (AP). First, a DCN [129] is introduced as the backbone to learn complex geometric transformations; then, Inception is used to improve the multi-scale representation. And, the loss of bounding box regression is changed from smooth L1 to DIoU [130] due to DIoU being more accommodating for small objects. Finally, an IoU-aware branch is included to address the mismatch between localization accuracy and classification scores. Hard example mining techniques like OHEM [131] identify hard positive and hard negative examples and focus more effort on training on those hard instances to improve detector performance. Zhang et al. [132] increased the effectiveness of OHEM by combining OHEM with hard image-level mining to train the face detector; it automatically alters training weights on images according to their difficulty. Additionally, they used a detector that only produces a single high-resolution feature map with small anchors to specifically learn small faces and train it by using the hard image mining strategy. The strengths and disadvantages of small face detection methods are shown in Table 7.
Method | Publication | Techniques | Strengths | Weaknesses |
Hu and Ramanan [123] | CVPR17 | Super-resolution by GAN. | The joint super-resolution and refinement model is effective. | No fusion of contextual information. |
S3FD [124] | ICCV17 | Scale-invariant strategy. | The three-trick, scale-equitable framework, max-out, and scale compensation anchor-matching achieve superior performance. | |
MagNet [126] | WACV18 | Feature fusion approach to integrating contextual information. | ConvTranspose is more helpful than skip connections or context pooling. | The improvement is not obvious. |
Zhu et al. [127] | CVPR18 | EMO metric to get a high IoU. | The EMO score inspired several effective strategies for a new anchor design to obtain a higher facial IoU score. | |
TinaFace [128] | arXiv21 | Geometric transformations and multi-scale representation. | Simple improvements of RetinaNet achieve better performance. | |
Zhang et al. [132] | WACV20 | Hard example mining and super-resolution. | It handles the imbalance between images. |
Song et al. [133] proposed a topological line localization (TLL) network, i.e., a topological line detection network based on the pedestrian torso, which was designed to reduce the effects of small-scale pedestrian boundary blur, appearance blur and the annotation method of the bounding box that brings too much of a noisy background to small objects. And, combining TLL and ConvLSTM into a single time-aware architecture to aggregate the features of consecutive frames in the video thus enhances the performance of small pedestrian detection. Furthermore, a Markov random field, as a post-processing strategy, is employed to deal with crowd occlusion. Das et al. [134] constructed the ISI pedestrian dataset, which includes 13,129 annotated video frames with 82.3 thousands labeled pedestrians. Additionally, Das et al. provided a three-phase detection algorithm. First, the prospective regions in each frame are identified using a zone classifier, which uses an improved Inception network to lower the error. The frames per second is then significantly improved by solely using the possible regions to locate the pedestrian's position. Finally, using non-maximum suppression (NMS) is applied to remove the redundant bounding box of the same pedestrian.
CNNs can not only learn low-level features, but it also has a strong ability to learn high-level semantic features. Therefore, CSP [135] simplifies pedestrian detection into pedestrian scale prediction and center tasks through a convolutional operation. The detection head applies a convolutional operation to the feature map generated by the feature extractor and adds two parallel 1 × 1 convolutions to generate, respectively, a centroid heat map and a scale size prediction map. Cross-entropy loss is employed in center point prediction and L1 loss is employed in scale prediction. Yu et al. [136] constructed the TinyPerson dataset, which focuses on persons on, at and around the seaside for maritime quick rescue. Pedestrians in TinyPerson are much smaller than those in other datasets, with most having pixel ranges of under 20 pixels and a wide variance in the person's aspect ratio. And, to solve the problem that the distribution of the pre-training dataset differs greatly from the distribution of the dataset for the specified task, this algorithm proposes a scale match to make the feature distribution consistent between the pre-trained dataset E and the task-specific dataset D, as shown in Eq (6), where $ {P}_{\left(s, D\right)} $ is defined as the probability density function of objects of size s in the dataset D, and T is the scale change function.
$ {P}_{\left(s, T\left(E\right)\right)\approx {P}_{\left(s, D\right)}} $ | (6) |
The FSAF [37] allows each instance to freely choose the best layer to optimize the network, instead of using the traditional pyramid, which puts several anchors of a fixed size at each level. The best feature layer for each instance is dynamically selected throughout the training phase based on the content of the instance, rather than just its size; the selection function is given by Eq (7):
$ {l}^{\text{'}} = \lfloor {l}_{0}+{log}_{2}\left(\frac{\sqrt{wh}}{224}\right)\rfloor $ | (7) |
where 224 is the ImageNet pre-training size and $ {l}_{0} $ is the initial feature layer. The strengths and disadvantages of small pedestrian methods are shown in Table 8.
Method | Publication | Techniques | Strengths | Weaknesses |
Song et al. [133] | ECCV18 | A topological line detection. | TLL can automatically adapt to small-scale pedestrians. | No mitigation of information loss for small pedestrians. |
SaYwF [134] | arXiv19 | A three-phase detection model. | Achieves a trade-off between detection accuracy and detection speed. | |
CSP [135] | CVPR19 | Pedestrian detection is converted to high-level semantic feature prediction. | No additional post-processing is required for CSP. | Objects with a large variance in aspect ratio need to be examined. |
FSAF [37] | CVPR19 | Feature-selective anchor-free module. | Dynamically assigning each instance to the most suitable feature level is more robust. | Separate anchor-free branches do not have many advantages over anchor-based branches. |
Yu et al. [136] | WACV20 | Scale match of the pre-trained dataset to the task-specified dataset. | Scale match can better utilize the existing annotated data. | It has poor performance on TinyPerson. |
Object detection in aerial images is crucial in many real-world applications, including urban planning, emergency rescue [137], traffic detection [138,139], etc. Since aerial images are usually taken from high altitudes looking down, the rotation of objects varies greatly and is displayed in arbitrary directions. In addition, aerial images contain highly dense scenes and many small objects, making SOD a complex problem for aerial remote sensing images. Innovative detection algorithms have emerged to address these issues.
S2A-Net [140] contains a feature alignment module and an oriented detection module to keep consistency between the classification score and localization accuracy. SCRDet [141] designed a supervised multidimensional attention to highlight small object regions and reduce the effect of background noise. Oriented R-CNN [142] and MRDet [143] both proposed a lightweight regional proposal network to generate oriented proposals. [144] proposed a novel model which contains four parts. To extract feature maps from the input photos, the first component serves as the backbone. The backbone incorporates a ResNet50 network with deformable convolutional layers because a regular convolution cannot adjust to variations in viewpoint in images taken by drones. The second part seeks to use an FPN to exploit and improve the feature maps obtained from ResNet50. The RPN, which can be used to extract prospective proposals of objects in the image, is the third component. The last section is a task head for certain goals. Bounding box and mask prediction are assigned by using an interleaved cascade architecture by the component. Yi et al. [145] extended the center key-point object detector for oriented object detection. A U-shaped network [146] is the foundation of the model. In the process of up-sampling, skip connections are used to combine feature maps. Four maps make up the output of the architecture: the heat map, offset map, box parameter map and orientation map. The heat map and offset map are used to deduce the locations of the center points. After the center points are detected, the box boundary-aware vectors (BBAVectors) are regressed to capture the oriented bounding boxes.
According to Han et al. [147], CNNs lack rotation invariance, which means that, after an image is rotated, the features it extracted will also change. ReCNN was therefore proposed, allowing CNNs to have rotation invariance. They incorporate rotation-equivariant networks into the backbone to extract rotation-equivariant features, which allows for precise prediction of the orientation. Then, the rotation-invariant RoI Align module was developed based on RROI Align [148] to align both the channel dimension and the spatial dimension to obtain the rotation invariance features. DarkNet-RI [149] uses DarkNet53 [7] as a backbone that contains a rotation-invariant layer to extract rotation-invariant multi-scale features and use classification solutions to directly predict the location of objects. After that, a box refinement module is utilized to carry out additional NMS to eliminate overlapping and redundant bounding boxes. RepPoints [150] develops adaptive point sets and can capture the geometric structure of airborne objects with abrupt changes in direction in a chaotic environment. Three oriented conversion functions were presented by Li et al. [151] to transform adaptive points into oriented bounding boxes for various oriented objects. They apply MinAeraRect in the post-processing to provide the usually rotated rectangle prediction, and the NearestGTCorner and MinAeraRect functions are applied to enhance adaptive point learning during training. Xu et al. [152] proposed Dot Distance (DotD), i.e., a normalized Euclidean distance between the centroids of two bounding boxes, to solve the problem of IoU being sensitive to minute offsets between bounding boxes when detecting tiny objects. S2ANET-SR [153] uses super-resolution to enhance the feature extraction of small objects in remote sensing images and incorporates perceptual loss and texture matching loss to train S2ANET-SR jointly with the detection loss. The authors of [154] developed a cross-layer attention module to extract non-local features from small objects to enhance their features. The authors of [155] utilized a Gaussian mixture model to generate focal regions, as well as an incomplete box suppression method to mitigate the truncated box problem, which improved the performance of SOD. The strengths and weaknesses of aerial image methods are shown in Table 9.
Method | Publication | Techniques | Strengths | Weaknesses |
Zhang et al. [144] | ICCV19 | Model fusion, cascade network, deformable convolution and data augmentation. | The joint optimization of four strategies makes the model perform well on VisDrone. | The efficiency and detection speed of the network is poor. |
Yi et al. [145] | WACV21 | Oriented anchor-free object detector. | Extended BBAVector technique on CenterNet is simple and effective. | |
ReDet [147] | arXiv21 | Rotation-invariant feature representation. | Smaller models and better results for small- and medium-sized objects. | |
DarkNet-RI [149] | TGRS21 | The multi-scale and rotation-invariant feature representation is robust against scale-variance. | Need to enhance the overlapping and occluded object detection. | |
Li et al. [151] | CVPR22 | Adaptive points learning approach. | This model can classify and localize objects with arbitrary orientation. | It requires large computing cost. |
DotD [152] | CVPRW21 | A new metric DotD. | It's valid for defining positive and negative anchors in training. |
This section provides an overview of the SOD datasets that are currently available. The performance of SOTA SOD approaches is also evaluated by using three large-scale datasets. We chose well-known image datasets: MS COCO for the general SOD evaluation, WiderFace for SOD tasks with small faces, TinyPersons for SOD tasks with small pedestrians and DOTA for SOD tasks for aerial images.
A high-quality dataset is important for developing advanced object detection algorithms. In recent years, many well-known datasets for object detection have been published, such as MS COCO [43] and VOC [45]. VOC is a dataset for the Pascal VOC challenge object detection subtask, which has two versions: VOC2007 and VOC2012. More than 27 thousands object instance bounding boxes are labeled in 33,043 images in VOC2012. MS COCO is a sizable multi-task dataset, as it has 91 object categories in all (80 object categories are used for object detection tasks) and 2500 thousands labeled instances in 328 thousands images. The tasks on the COCO dataset are more challenging because, in contrast to VOC, COCO contains more small objects and more complicated backgrounds in the images. COCO also has a more balanced object distribution. Less than 20% of the images in the COCO dataset have only one category and an average of 3.5 categories and 7.7 instance objects of each image. Over 70% of the images in the VOC dataset have only one category; on average, there are only 1.4 categories and 2.3 instance objects per image. These benchmarks boost the development of detecting regular-sized objects. Unfortunately, the detection of small objects is still insufficient. It is caused by both the characteristics of small objects themselves, as well as the fewer benchmarks designed for SOD. To provide a comprehensive review of a dataset, we investigated datasets containing a large number of small objects that span various SOD tasks, such as face detection, pedestrian detection, traffic sign/light detection and aerial image object detection, as shown in Tables 10–13.
Dataset | Year | Description |
WIDER FACE [47] | 2016 | WIDER FACE is a large-scale dataset of face images. Images are selected from the publicly available WIDER dataset. |
IJB [156] | 2015 | IJB-A/B/C is a dataset for face detection and recognition. IJB-A contains 1845 objects, 11,754 images, 55,026 video frames, 7011 videos and 10,044 non-facial images. |
DarkFace [157] | 2019 | The DarkFace dataset offers 6000 nighttime low-light photos from real-world locations, all labeled with bounding boxes of human faces. Additionally, this dataset has 9000 unlabeled low-light images taken in the same environment. |
UFDD [158] | 2018 | UFDD, an unconstrained face detection dataset, consists of more than 6000 images and 11,000 faces, and it contains seven scenes: rain, snow, haze, blur, illumination, lens impediments and distractors. |
WildestFaces [159] | 2018 | The WildestFaces dataset includes 67,889 pictures. Along with annotations for face detection and recognition, it also includes tags for blur severity, scale and occlusion. |
Dataset | Year | Description |
TinyPerson [136] | 2020 | TinyPerson is a challenging benchmark for tiny object detection in a complex context and at a long distance. A total of 72,651 labeled very small objects are included in the dataset. |
WiderPerson [160] | 2020 | The WiderPerson dataset, which contains 32203 images with a total of 393703 instances. |
EuroCity [161] | 2018 | The EuroCity person dataset was collected in several European countries by in-vehicle cameras; it includes about 47,300 images with more than 238,200 annotated instances of people. |
Citypersons [162] | 2017 | The Citypersons dataset is a subset of a cityscape; it offers 5,000 images from 27 cities with 30 fine-grained, pixel-level annotations. |
Caltech [163] | 2009 | Caltech is a challenging dataset that contains low-resolution, frequently obstructed objects. There are 192,000 and 155,000 pedestrian instances in the training and testing sets, respectively. |
Dataset | Year | Description |
DIOR [58] | 2020 | DIOR is made up of 20 common object categories, 23,463 optimum remote sensing images and 192,472 hand-annotated object instances with axis-aligned bounding boxes. |
VisDrone [164] | 2022 | VisDrone was collected by the AISKYEYE team at Tianjin University in China while utilizing several UAVs; it includes pedestrians, automobiles, bicycles and other categories. |
UAVDT [165] | 2018 | UAVDT is a sizable UAV-based video dataset with 80,000 total frames that are intended for vehicle detection and tracking. |
DOTA [166] | 2018 | DOTA has three versions so far; DOTA-v1.0 includes 188,282 instances of 2806 aerial images in 15 main categories. |
NWPU VHR-10 [167] | 2016 | The NWPU VHR-10 dataset contains a total of 800 very high-resolution optical remote sensing images, which were acquired from Google Earth and Vaihingen. |
UCAS-AOD [168] | 2015 | The UCAS-AOD datasets include many small objects with intricate backgrounds with a total of 2420 images and 14,596 instances. |
Dataset | Year | Scenario | Description |
SOD [42] | 2021 | Generic | SOD is a subset of the SUN [171] and MS COCO datasets. Ten types of objects that appear extremely small in the images were manually chosen by the authors. |
TT100K [56] | 2016 | Traffic Sign | TT100K has 100,000 images and 30,000 traffic sign instances across 128 classes. |
DeepScores [169] | 2018 | Stradivarius | DeepScores includes high-quality images of sheet music, with around 100 million small objects. |
KITTI [170] | 2012 | Traffic Scene | KITTI has up to 15 vehicles and 30 pedestrians in each image captured in Karlsruhe, Germany. |
Frames per second refers to the speed of object detection, and it indicates the number of images that can be processed within each second. A higher value implies that the method is faster and can potentially be applied to real-time SOD.
IoU measures the similarity between the areas of the prediction bounding box (bboxpred) and the ground truth bounding box, bboxGT. The IoU function is given by Eq (8).
$ IoU = \frac{Area\left({bbox}_{pred}\;\;\cap {bbox}_{}\right)}{Area\left({bbox}_{pred}\;\;\cup {bbox}_{}\right)} $ | (8) |
AP is a common metric for object detection tasks, and the following definitions are used in the AP calculation.
1) Positive sample: a sample that contains a detection object, and the prediction bbox confidence score is larger than the set threshold.
2) Negative samples: samples that do not contain detection objects, and the prediction bbox confidence score is larger than the set threshold.
3) True Positive (TP): positive samples that are predicted correctly.
4) True Negative (TN): negative samples that are predicted to be correct.
5) False Positive (FP): positive samples that are predicted to be wrong.
6) False Negative (FN): negative samples that are predicted to be wrong.
In the VOC dataset, the IoU threshold is typically set to 0.5. Positive samples with IoU values higher than 0.5 are labeled as TP, and positive samples with IoU values lower than 0.5 are labeled as FP. FN indicates the number of objects in ground truth that are not found. Then, the precision rate and recall rate are given in Eqs (9) and (10). AP is calculated across different recalls. Specifically, for a given recall value r, the precision value takes the maximum of all recall values that are greater than or equal to r. Then, the area under the precision-recall (P-R) curve is referred to as the AP value. The mAP is the mean AP value across all categories. AP and mAP are given in Eqs (11) and (12).
$ precision = \frac{TP}{TP+FP} = \frac{TP}{allpredictedbox} $ | (9) |
$ recall = \frac{TP}{TP+FN} = \frac{TP}{allgroundtruth} $ | (10) |
$ AP = {\int }_{0}^{1}P\left(R\right)dR $ | (11) |
$ mAP = \frac{\sum AP}{N} $ | (12) |
The stricter COCO evaluation metric is more widely used than the PASCAL VOC evaluation metric. The IoU thresholds of it typically range from 0.5 to 0.95, with a 0.05 step size. A special AP is also calculated separately for small (the square of the area < 322), medium (322 < area < 962), and large (area > 962) objects in MS COCO.
Table 14 shows the performance evaluation results for generic SOD algorithms applied to the COCO dataset; note that AP has the same meaning as mAP. AP50 and AP75 denote the AP when the IoU is set to 0.5 or 0.75, respectively, while APs, APm and APl denote the average accuracy for small, medium and large objects, respectively. As shown, IENet [179 achieves the best AP (51.2). In general, the detection performance for large objects is much higher than that for other sizes. HRDNet [78] achieves a value of 32.1 for small objects, whereas MRCenterNet [118] achieves a value of 27.8 for small objects. These results show that increasing the resolution of the input feature with multi-scale training can yield better performance on small objects. All experiments were conducted on a Linux system with NVIDIA GeForce RTX 2080Ti, CUDA 11.7.
Model | Year | Backbone | AP | AP50 | AP75 | APs | APm | APl | FPS |
Faster R-CNN [4] | 2015 | R101-FPN | 36.5 | 58.3 | 39.3 | 18.4 | 40.6 | 50.6 | 6 |
Mask R-CNN [5] | 2017 | R101‑FPN | 38.2 | 60.3 | 41.7 | 20.1 | 41.1 | 50.2 | 8.6 |
YOLOv7-tiny [8] | 2022 | 38.7 | 286 | ||||||
YOLOv7-E6E [8] | 2022 | 56.8 | 74.4 | 62.1 | 36 | ||||
FPN [35] | 2017 | R101-FPN | 36.2 | 59.1 | 39.0 | 18.2 | 39.0 | 48.2 | 6 |
SSD [36] | 2015 | ResNet-101 | 31.2 | 50.4 | 33.3 | 10.2 | 34.5 | 49.8 | 28 |
CornerNet* [38] | 2018 | Hourglass | 40.5 | 56.5 | 43.1 | 19.4 | 42.7 | 53.9 | 4.1 |
FCOS [39] | 2019 | R101-FPN | 41.8 | 60.3 | 45.3 | 25.6 | 47.7 | 56.1 | 7 |
Efficientdet [71] | 2020 | Efficientdet | 33.8 | 52.2 | 35.8 | 12.0 | 38.3 | 51.2 | 98 |
RetinaNet [74] | 2017 | R101-FPN | 39.1 | 59.1 | 42.3 | 21.8 | 42.7 | 50.2 | 13.6 |
FSSD [75] | 2017 | VGG16 | 31.8 | 52.8 | 33.5 | 14.2 | 35.1 | 45.0 | 65 |
HRDNet* [78] | 2021 | R101+152 | 47.4 | 66.9 | 51.8 | 32.1 | 50.5 | 55.8 | 2.8 |
RHF-Net [79] | 2020 | ResNet-101 | 37.7 | 59.8 | 40.1 | 19.9 | 42.9 | 51.5 | 29.1 |
QueryDet [81] | 2021 | R50-FPN | 38.2 | 58.6 | 40.9 | 23.7 | 42.0 | 49.5 | 13.6 |
SNIP [95] | 2018 | DPN [174] | 45.7 | 67.3 | 51.1 | 29.3 | 48.8 | 57.1 | 5 |
SNIPER [96] | 2018 | ResNet101 | 46.1 | 67.0 | 51.6 | 29.6 | 48.9 | 58.1 | 5 |
FR-FDWT [99] | 2019 | ResNet-101 | 42.1 | 63.4 | 45.7 | 21.8 | 45.1 | 57.1 | 7 |
ION [101] | 2016 | VGG16 | 24.6 | 46.3 | 23.3 | 7.4 | 26.2 | 38.8 | 1.3 |
DSSD [102] | 2017 | ResNet101 | 33.2 | 53.3 | 35.2 | 13.0 | 35.4 | 51.1 | 6.4 |
FRCNN-DST [113] | 2021 | R101-FPN | 40.1 | 59.3 | 43.2 | 25.6 | 43.9 | 50.9 | 9 |
Retina-DST [113] | 2021 | R101-FPN | 41.3 | 59.9 | 43.8 | 25.4 | 45.1 | 54.0 | 13.6 |
FCOS-DST [113] | 2021 | R101-FPN | 41.6 | 60.0 | 44.6 | 26.5 | 45.4 | 53.1 | 7 |
PPDet [117] | 2020 | R101-FPN | 39.6 | 58.0 | 43.4 | 23.9 | 44.1 | 51.0 | 7.5 |
CenterNet++ [118] | 2022 | ResNet-101 | 47.7 | 65.1 | 51.9 | 27.8 | 50.5 | 60.6 | 104 |
DCN* [125] | 2017 | AlignedIncR | 37.5 | 58.0 | 40.8 | 19.4 | 40.1 | 52.5 | 7 |
RefineDet [172] | 2018 | ResNet-101 | 36.4 | 57.5 | 39.5 | 16.6 | 39.9 | 51.4 | 24 |
D2Det [173] | 2020 | R101-FPN | 45.4 | 64.0 | 49.5 | 25.8 | 48.7 | 58.1 | 4 |
CoupleNet [175] | 2017 | ResNet101 | 33.1 | 53.5 | 35.4 | 11.6 | 36.3 | 50.1 | 8.2 |
Regionlets [176] | 2018 | ResNet-101 | 39.3 | 59.8 | – | 21.7 | 43.7 | 50.9 | – |
FitnessNMS [177] | 2018 | ResNet-101 | 41.8 | 60.9 | 44.9 | 21.5 | 45.0 | 57.5 | – |
PPYOLOE [178] | 2022 | CSPRepRes | 43.1 | 60.5 | 46.6 | 23.2 | 45.2 | 56.9 | 208 |
IENet [180] | 2021 | ResNet-101 | 51.2 | 69.3 | 56.1 | 34.5 | 53.8 | 63.6 | 3 |
In Table 15, we evaluate small face detection methods on WIDERFACE [47]. WIDERFACE defines three levels of difficulty: 'easy', 'medium' and 'hard' based on the detection rate of EdgeBox [180]. As shown, TinaFace [128] achieves the best AP; the AP values for the easy, medium and hard test sets are 96.3, 95.7 and 92.1 respectively. IENet [180] achieves relatively better results, the AP values for the easy, medium and hard test sets are 96.1, 94.7 and 89.6, respectively. TinaFace and IENet both increase the resolution of the prediction feature map, which fully utilizes the fused feature map. IENET also fully incorporates the contextual information. It shows that boosting the resolution of the prediction feature map and incorporating contextual information may be the key to enhancing face detection.
Method | Year | Backbone | AP | ||
Easy | Medium | Hard | |||
Faster R-CNN [4] | 2015 | ResNet50 | 84.0 | 72.4 | 34.7 |
RetinaNet [74] | 2017 | ResNet50 | 94.8 | 93.8 | 89.6 |
S3FD [124] | 2017 | VGG16 | 93.4 | 92.7 | 85.4 |
TFD with GAN [125] | 2018 | VGG16 | 93.2 | 92.2 | 85.8 |
Face-MagNet [126] | 2018 | ResNet101 | 92.5 | 91.4 | 83.1 |
TinaFace [128] | 2020 | ResNet50 | 96.3 | 95.7 | 92.1 |
Small Hard Face [132] | 2020 | VGG16 | 95.0 | 93.8 | 88.5 |
IENet [180] | 2021 | ResNet50 | 96.1 | 94.7 | 89.6 |
RetinaNet | 2019 | Mobilenet [181] | 87.9 | 80.7 | 40.3 |
PyramidBox [182] | 2018 | ResNet50 | 95.5 | 94.6 | 88.8 |
RetinaFace [183] | 2019 | ResNet50 | 88.6 | 87.0 | 80.1 |
Table 16 shows the typical small pedestrian SOD methods on the TinyPerson [136] dataset. MR [184] denotes the miss rate. The size divides are indicated by the superscripts MR and AP, where tiny denotes the size range (2, 20) and small denotes the size range (20, 32). The IoU thresholds utilized for the evaluation are indicated by the subscripts of MR and AP. Among these algorithms, FCOS [39] achieves the best results for all MR evaluations. With an IoU of 0.5, the FPN produced the best AP for small and tiny objects, whereas the Grid R-CNN [185] did so with IoU values of 0.25 and 0.75, respectively.
Method | Year | $ M{R}_{50}^{tiny} $ | $ M{R}_{50}^{small} $ | $ M{R}_{25}^{tiny} $ | $ M{R}_{75}^{tiny} $ | $ {AP}_{50}^{tiny} $ | $ {AP}_{50}^{small} $ | $ {AP}_{25}^{tiny} $ | $ {AP}_{75}^{tiny} $ |
Faster R-CNN [4] | 2015 | 87.78 | 71.31 | 77.35 | 98.4 | 43.55 | 56.69 | 64.07 | 5.35 |
FPN [36] | 2017 | 87.57 | 72.56 | 76.59 | 98.39 | 47.35 | 63.18 | 68.43 | 5.83 |
FCOS [39] | 2019 | 96.12 | 84.14 | 89.56 | 99.56 | 17.9 | 35.75 | 40.49 | 1.45 |
RetinaNet [74] | 2017 | 92.66 | 82.84 | 81.95 | 99.13 | 33.53 | 48.26 | 61.51 | 2.28 |
Grid R-CNN [185] | 2018 | 87.96 | 73.16 | 78.27 | 98.21 | 47.14 | 62.48 | 68.89 | 6.38 |
DSFD [186] | 2019 | 93.47 | 78.72 | 78.02 | 99.48 | 31.15 | 51.64 | 59.58 | 1.99 |
FreeAnchor [187] | 2022 | 88.97 | 73.67 | 77.62 | 98.7 | 41.36 | 53.36 | 63.73 | 4.00 |
Li-RCNN [188] | 2019 | 89.22 | 74.86 | 82.44 | 98.78 | 44.68 | 62.65 | 64.77 | 6.26 |
In Table 17, we compare the performance of state-of-the-art aerial image object detection algorithms on DOTA-v1.0 [166], which consists of 15 categories: plane (PL), baseball diamond (BD), bridge (BR), ground field track (GTF), small vehicle (SV), large vehicle (LV), tennis court (TC), basketball court (BC), storage tank (SC), soccer-ball field (SBF), roundabout (RA), harbor (HA), swimming pool (SP) and helicopter (HC). ReDet and Oriented R-CNN achieve the best mAP value of 76.3. The best AP in each category is marked in bold.
Method | Year | PL | BD | BR | GTF | SV | LV | SH | TC | BC | ST | SBF | RA | HA | SP | HC | mAP |
Faster R-CNN-O [4] | 2015 | 88.4 | 73.1 | 44.9 | 59.1 | 73.3 | 71.5 | 77.1 | 90.8 | 78.9 | 83.9 | 48.6 | 63.0 | 62.2 | 64.9 | 56.2 | 69.1 |
Mask R-CNN [5] | 2020 | 76.8 | 73.5 | 49.9 | 57.8 | 51.3 | 71.3 | 79.7 | 90.4 | 75.1 | 67.3 | 48.5 | 70.6 | 64.8 | 64.5 | 55.9 | 63.4 |
CenterNet-O [40] | 2019 | 81.0 | 64.0 | 22.6 | 56.6 | 38.6 | 64.0 | 64.9 | 90.8 | 78.0 | 72.5 | 44.0 | 41.1 | 55.5 | 55.0 | 57.4 | 59.1 |
RetinaNet-O [74] | 2017 | 88.6 | 77.6 | 42.1 | 58.1 | 74.5 | 71.6 | 79.1 | 90.8 | 82.1 | 74.3 | 54.7 | 60.6 | 62.5 | 69.5 | 60.4 | 68.2 |
S2A-Net [140] | 2019 | 89.1 | 82.8 | 48.3 | 71.1 | 78.1 | 78.3 | 87.2 | 90.8 | 84.9 | 85.6 | 60.3 | 62.6 | 65.2 | 69.1 | 57.9 | 74.1 |
SCRDet [141] | 2019 | 89.9 | 80.7 | 52.1 | 68.4 | 68.4 | 60.3 | 72.4 | 90.9 | 87.9 | 86.9 | 65.0 | 66.7 | 66.3 | 68.2 | 65.2 | 72.6 |
Oriented R-CNN [142] | 2019 | 88.9 | 83.5 | 55.3 | 76.9 | 74.3 | 82.1 | 87.5 | 90.9 | 85.6 | 85.3 | 65.5 | 66.8 | 74.4 | 70.2 | 57.3 | 76.3 |
MRDet [143] | 2019 | 89.5 | 84.0 | 55.4 | 66.7 | 76.3 | 82.1 | 87.9 | 90.8 | 86.9 | 85.0 | 52.3 | 66.0 | 76.2 | 76.8 | 67.5 | 76.2 |
BBAVectors [145] | 2021 | 88.4 | 80.0 | 50.7 | 62.2 | 78.4 | 79.0 | 87.9 | 90.9 | 83.6 | 84.4 | 54.1 | 60.2 | 65.2 | 64.3 | 55.7 | 72.3 |
ReDet [147] | 2021 | 88.8 | 82.6 | 54.0 | 74.0 | 78.1 | 84.1 | 88.0 | 90.9 | 87.8 | 85.8 | 61.8 | 60.4 | 76.0 | 68.1 | 63.6 | 76.3 |
ROI-Trans [148] | 2019 | 88.6 | 78.5 | 43.4 | 75.9 | 68.8 | 73.7 | 83.6 | 90.7 | 77.3 | 81.5 | 58.4 | 53.5 | 62.8 | 58.9 | 47.7 | 69.6 |
RepPoints-O [151] | 2021 | 87.0 | 83.2 | 54.1 | 71.2 | 80.2 | 78.4 | 87.3 | 90.9 | 86.0 | 86.3 | 59.9 | 70.5 | 73.5 | 72.3 | 59.0 | 76.0 |
CAD-Net [189] | 2019 | 87.8 | 82.4 | 49.4 | 73.5 | 71.1 | 63.5 | 76.6 | 90.9 | 79.2 | 73.3 | 48.4 | 60.9 | 62.0 | 67.0 | 62.2 | 69.9 |
Based on the experimental results, we further discuss some limitations of existing SOD methods as follows.
1) The framework of SOD is generally modified by popular models like Faster R-CNN, SSD and YOLO; these architectures may not be suitable for small objects, leading to poor performance.
2) Using super-resolution to enhance the resolution of a small object can improve the precision of SOD, but the detection speed will be significantly lower and unable to fulfill the demands of real-world scenarios like real-time monitoring.
3) Transformers have been widely applied in the computer vision field, like DETR [190] in object detection. However, there has not been much research on using transformers for SOD.
4) CNNs are not sensitive to scale changes. There is a need to design feature extractors that are more suitable for scale-aware.
5) MS COCO may not be an ideal benchmark for small objects because small objects account for a relatively small percentage of the dataset.
In addition to the common challenges in object detection, such as continual object detection, imbalance problems, etc. There are typical challenges when it comes to SOD, including feature representation with noise, small object information loss, the effect of the receptive field, location variation sensitivity and the scarcity of small object datasets.
Feature representation with noise. The features of small objects are often contaminated by noise in the background after CNN implementation, making it difficult for the network to capture the discriminative information that is pivotal for the localization and classification tasks. Besides, small objects are often occluded and clustered, so it is particularly difficult to distinguish small objects from noisy clutter and precisely locate their boundaries.
Small object information loss. The features of a small object are virtually eliminated after the down-sampling operations in deeper neural networks due to the small number of pixels occupying each small object. The weak information wipeout of small objects is fatal to SOD because it is hard for the detection head to give accurate predictions in the presence of highly structural representations.
Effect of the receptive field. Large receptive fields are typically chosen by deep neural networks to prevent the loss of information. However, the receptive field for the prediction low-resolution feature map may not match the size of small objects. If the receptive field is larger than the small object, it will cause the object to be detected to become the background, and no features will be extracted by backbone networks, resulting in poor SOD performance.
Location variation sensitivity. Small location deviation of the bounding box in the IoU-based metric produces a more significant disturbance for small objects than for larger objects, which makes it difficult to find a suitable IoU threshold and deliver high-quality positive and negative samples to train the networks.
Scarcity of small object datasets. There are still not enough large-scale general small object datasets to match the cost of annotating small objects. Although MS COCO has a reasonably large amount of small objects (31.62%), each image has too many instances, which leads to the uneven distribution of small objects.
According to the challenges of SOD and the analysis of performance results, we discuss several potential directions for future research in SOD:
1) Weakly supervised, unsupervised and self-supervised SOD. Existing deep learning-based SOD techniques use a fully supervised model. For model training, a sizable number of images with bounding-box annotations (fully supervised information) are required. However, the annotation work is labor-intensive and time-consuming. Weakly supervised object detection can use image-level labels (such as image categories) as supervised signals to train object localization models without the need for pixel-level annotation, which lessens the workload associated with the annotation. Unsupervised salient object detection [191] and self-supervised learning tasks [192] based on contrastive learning have become hot research topics in the past 2 years. Therefore, it is crucial to continue researching the development of weakly supervised learning-based SOD algorithms.
2) Suitable metric for SOD. IoU-based metrics, including the original IoU and its extensions (DIoU, GIou, etc.), are extremely sensitive to the position deviation of small objects and significantly reduce the detection performance when utilized in anchor-based detectors. The authors of [119] use a new Wasserstein distance-based SOD metric, which outperformed the standard fine-tuning baseline by an AP value of 6.7 AP, as well as the state-of-the-art SOTA model by an AP value of 6.0. Therefore, designing a suitable metric for small objects will be crucial to further research.
3) Multi-task joint optimization. Even though techniques like scale-aware training strategies, incorporating contextual information, data augmentation and increasing the resolution of input features help to improve SOD performance, they are still far from adequate, and the combined use of these methods may be able to further improve SOD performance.
4) Open world or few-shot SOD. Few-shot object detection [193] has produced prominent achievements, and SOD in the few-shot scenario is also in urgent need of solutions. Open world SOD seeks to overcome the SOD conundrum while enabling incremental learning in the model, and this type of issue will be a significant research topic in the future.
An in-depth review of state-of-the-art deep learning-based SOD algorithms is provided in this paper. We focus on SOD optimization approaches that aim to address the challenges of SOD, including scale-aware training, contextual information incorporation, data augmentation and boosting the resolution of input features. We have summarized the strengths and limitations of these approaches. We have also reviewed methods for crucial SOD tasks, including tiny face detection, tiny pedestrian detection and aerial image object detection. Additionally, detailed experiments were carried out to evaluate the performance of generic SOD algorithms, as well as methods for crucial SOD tasks; we found that boosting the resolution of input features is the most efficient way to improve SOD performance. Finally, we have presented four potential future directions for SOD.
This work was supported by the National Natural Science Foundation of China (No. 61876186) and Xuzhou Science and Technology Project (KC21300).
The authors declare no conflict of interest.
[1] | Rahimi S, Fadakar Soghe R, Tabri R, et al. (2014) Related factors with quality of life among preschool children with cancer. J Holist Nurs Midwifery 24: 30–39. |
[2] | Rezaei Z, Sharifian Sani M, Ostadhashemi L, et al. (2018) Quality of life of mothers of children with cancer in Iran. koomesh 20: 425–431. |
[3] |
Peris-Bonet R, Salmerón D, Martínez-Beneito M, et al. (2010) Childhood cancer incidence and survival in Spain. Ann Oncol 21: iii103–iii110. doi: 10.1093/annonc/mdq207
![]() |
[4] |
White MC, Holman DM, Boehm JE, et al. (2014) Age and cancer risk: A potentially modifiable relationship. Am J Prev Med 46: S7–S15. doi: 10.1016/j.amepre.2013.10.029
![]() |
[5] | Ross JA, Olshan AF (2004) Pediatric cancer in the United States: The Children's Oncology Group Epidemiology Research Program. Cancer Epidemiol Biomarkers Prev 13: 1552–1554. |
[6] |
Parkin DM, Stiller CA, Draper GJ, et al. (1988) The international incidence of childhood cancer. Int J Cancer 42: 511–520. doi: 10.1002/ijc.2910420408
![]() |
[7] | Sajjadi H, Roshanfekr P, Asangari B, et al. (2011) Quality of life and satisfaction with services in caregivers of children with cancer. Iran J Nurs 24: 8–17. |
[8] | Mehranfar M, Younesi J, Banihashem A (2012) Effectiveness of mindfulness-based cognitive therapy on reduction of depression and anxiety symptoms in mothers of children with cancer. Iran J Cancer Prev 5: 1–9. |
[9] |
Kendall S (2007) Witnessing tragedy: Nurses' perceptions of caring for patients with cancer. Int J Nurs Pract 13: 111–120. doi: 10.1111/j.1440-172X.2007.00615.x
![]() |
[10] |
McCarthy MC, Clarke NE, Vance A, et al. (2009) Measuring psychosocial risk in families caring for a child with cancer: The psychosocial assessment tool (PAT2. 0). Pediatr Blood Cancer 53: 78–83. doi: 10.1002/pbc.22007
![]() |
[11] | Allahbakhshian M, Jaffarpour M, Parvizy S, et al. (2010) A Survey on relationship between spiritual wellbeing and quality of life in multiple sclerosis patients. Zahedan J Res Med Sci 12: 29–33. |
[12] |
Aghazadeh R, Zali MR, Bahari A, et al. (2005) Inflammatory bowel disease in Iran: A review of 457 cases. J Gastroenterol Hepatol 20: 1691–1695. doi: 10.1111/j.1440-1746.2005.03905.x
![]() |
[13] | Ganji T, Hosseini F (2008) Spirituality and anxiety in nursing students of faculty of nursing and midwifery Iran university of medical science-2006. Int J Ment Health Nurs 17: A8. |
[14] |
MacDonald MA (2004) From miasma to fractals: the epidemiology revolution and public health nursing. Publ Health Nurs 21: 380–391. doi: 10.1111/j.0737-1209.2004.21412.x
![]() |
[15] | Reisi-Dehkordi N, Baratian H, Zargham-Boroujeni A (2014) Challenges of children with cancer and their mothers: A qualitative research. Iran J Nurs Midwifery Res 19: 334–339. |
[16] | English T, Davis J, Wei M, et al. (2016) Homesickness and adjustment across the first year of college: A longitudinal study. Emotion 17: 1–5. |
[17] | Mousavizadeh S, Torani S, Sohrabi F (2013) Female students becoming assertive as a way of increasing marital satisfaction. J Psychol Stud 9: 131–152. |
[18] | Bordbar FT, Tabatabaie KR, Falah PA, et al. (2009) Effect of assertiveness training on homesickness in girls students. J Mazandaran Univ Med Sci 19: 28–33. |
[19] |
Maleki M, Maleki F, Mirzayi S, et al. (2016) Investigating the modelling of psychological wellbeing according to self-expression and selfefficiency in high school students of Imam Khomeini Relief Committee. TOJDAC 6: 156–165. doi: 10.7456/1060ASE/016
![]() |
[20] | Omidi A, Akbari H, Jaddi-Arani T (2011) Efficacy of educational workshop on self-esteem of students at Kashan university of medical sciences. Feyz 15: 114–119. |
[21] | Khoshnood G, Shayan N, Babaie AN, et al. (2016) Relationship between religious orientation, happiness, locus of control and coping strategies, and spiritual well-being among nursing students. J Res Dev Nurs Midwifery 12: 9–18. |
[22] |
Li HW, Lopez V, Chung OJ, et al. (2013) The impact of cancer on the physical, psychological and social well-being of childhood cancer survivors. Eur J Oncol Nurs 17: 214–219. doi: 10.1016/j.ejon.2012.07.010
![]() |
[23] | Bahreini M, Mohammadi BM, Zare M, et al. (2005) Effect of assertiveness training on self-esteem on nursing students. Armaghane Danesh 10: 89–96. |
[24] | Mardani HM, Heidari H (2011) The Effect of assertiveness on postpartum depression. Ann Mil Health Sci Res 8: 265–270. |
[25] | Ranjbarkohn Z, Sajadinejad M (2010) Effect of assertiveness training on self-esteem and depression in students of Isfahan University of Medical Sciences. J Birjand Univ Med Sci 17: 308–315. |
[26] | Pirbodaghi M, Rasouli M, Ilkhani M, et al. (2015) An investigation of factors associated adaptation of mothers to disease of child with cancer based on roy model testing. Qom Univ Med Sci J 9: 41–50. |
[27] | Aghayani CA, Talebian D, Tarkhourani H, et al. (2008) Investigating the relationship of prayer with religious orientation and mental health. J Behav Sci 2: 149–156. |
[28] | Bahrami EH, Tamanaefar S, Bahrami EZ (2005) The ralationship between dimensions of religious orientation, mental health and pyschological disorders. J Dev Psychol 2: 35–42. |
[29] |
McNulty K, Livneh H, Wilson LM (2004) Perceived uncertainty, spiritual well-being, and psychosocial adaptation in individuals with multiple sclerosis. Rehabil Psychol 49: 91. doi: 10.1037/0090-5550.49.2.91
![]() |
[30] |
Vachon ML (2008) Meaning, spirituality, and wellness in cancer survivors. Semin Oncol Nurs 24: 218–225. doi: 10.1016/j.soncn.2008.05.010
![]() |
[31] | Yeganeh T (2013) Role of religious orientations in determination of hope and psychological well-being in female patients with breast cancer. Iran Q J Breast Dis 6: 47–56. |
[32] |
Tremolada M, Bonichini S, Schiavo S, et al. (2012) Post-traumatic stress symptoms in mothers of children with leukaemia undergoing the first 12 months of therapy: predictive models. Psychology Health 27: 1448–1462. doi: 10.1080/08870446.2012.690414
![]() |
1. | Yingjiang Xie, Zhennan Fei, Da Deng, Lingshuai Meng, Jinggong Sun, Fu Niu, 2024, A Review of Advances in Deep Learning-Based Small Object Detection, 979-8-3503-7382-0, 1207, 10.1109/CVIDL62147.2024.10603837 | |
2. | Jorge Armando Vicente-Martínez, Moisés Márquez-Olivera, Abraham García-Aliaga, Viridiana Hernández-Herrera, Adaptation of YOLOv7 and YOLOv7_tiny for Soccer-Ball Multi-Detection with DeepSORT for Tracking by Semi-Supervised System, 2023, 23, 1424-8220, 8693, 10.3390/s23218693 | |
3. | Bian He, Cao Jianzhong, Li Cheng, Dong Junpeng, Ruan Zhongling, Mei Chao, 2024, A Detection Method for Typical Component of Space Aircraft Based on YOLOv3 Algorithm, 979-8-3503-8098-9, 1726, 10.1109/EEBDA60612.2024.10485846 | |
4. | Yehui Liu, Yuliang Zhao, Xinyue Zhang, Xiaoai Wang, Chao Lian, Jian Li, Peng Shan, Changzeng Fu, Xiaoyong Lyu, Lianjiang Li, Qiang Fu, Wen Jung Li, MobileSAM-Track: Lightweight One-Shot Tracking and Segmentation of Small Objects on Edge Devices, 2023, 15, 2072-4292, 5665, 10.3390/rs15245665 | |
5. | Ruoqi Zhang, Xiaoming Huang, Qiang Zhu, Weakly supervised salient object detection via image category annotation, 2023, 20, 1551-0018, 21359, 10.3934/mbe.2023945 | |
6. | Yahui Yang, Shanhong He, Lihong Tang, Chenhui Dong, Wan Tian, Rui Li, 2023, Optimizing Small Target Detection in Engineering Drawings: A Scale-Aware Slicing Approach, 979-8-3503-5993-0, 280, 10.1109/MLBDBI60823.2023.10482360 | |
7. | Huiwen Liu, Ying-Bo Lu, Li Zhang, Fangchao Liu, You Tian, Hailong Du, Junsheng Yao, Zi Yu, Duyi Li, Xuemai Lin, Pseudo-Spectral Spatial Feature Extraction and Enhanced Fusion Image for Efficient Meter-Sized Lunar Impact Crater Automatic Detection in Digital Orthophoto Map, 2024, 24, 1424-8220, 5206, 10.3390/s24165206 | |
8. | Seongseok Kang, Junhong Park, Manhee Lee, Machine learning-enabled autonomous operation for atomic force microscopes, 2023, 94, 0034-6748, 10.1063/5.0172682 | |
9. | Ahmad Abubakar Mustapha, Mohamed Sirajudeen Yoosuf, Exploring the efficacy and comparative analysis of one-stage object detectors for computer vision: a review, 2023, 83, 1573-7721, 59143, 10.1007/s11042-023-17751-2 | |
10. | Tianyu Liang, Guigen Zeng, FSH-DETR: An Efficient End-to-End Fire Smoke and Human Detection Based on a Deformable DEtection TRansformer (DETR), 2024, 24, 1424-8220, 4077, 10.3390/s24134077 | |
11. | Jiayu Peng, Kai Lv, Guoliang Wang, Wendong Xiao, Teng Ran, Liang Yuan, MLSA-YOLO: a multi-level feature fusion and scale-adaptive framework for small object detection, 2025, 81, 1573-0484, 10.1007/s11227-025-06961-0 | |
12. | Asish Kumar Dalai, Hitesh Mohapatra, 2025, chapter 13, 9798369394052, 257, 10.4018/979-8-3693-9405-2.ch013 | |
13. | Sayed Jobaer, Xue-song Tang, Yihong Zhang, A deep neural network for small object detection in complex environments with unmanned aerial vehicle imagery, 2025, 148, 09521976, 110466, 10.1016/j.engappai.2025.110466 |
Title | Publication | Strengths | Limitations |
Object detection with deep learning: A review [57] | TNNLS 2019 | It reviews the deep learning-based object detection models and the difficulties of SOD. | These reviews offer a thorough summary of object detection. However, they concentrate on regular-sized object detection rather than small objects. |
Object detection in optical remote sensing images: A survey and a new benchmark [58] | ISPRS 2020 | It constructs DIOR, a large dataset of remote sensing. | |
Imbalance problems in object detection: A review [59] | arXiv 2020 | It reviews the imbalance problem of object detection. | |
Continual object detection: A review of definitions, strategies, and challenges [60] | arXiv 2022 | This survey investigates continual object detection. | |
New generation deep learning for video object detection: A survey [61] | TNNLS 2022 | It systematizes the latest video object detection models and analyzes the performance of these models on two datasets. | |
A survey of deep learning-based object detection [62] | IEEE Access 2022 | It reviews detection methods, general datasets and typical applications. | |
A survey of the four pillars for small object detection: Multi-scale representation, contextual information, super-resolution, and region proposal [64] | TSMC 2020 | It discusses the four pillars of SOD and reports on the performance of SOD on three datasets. | These studies do not contain a complete assessment of the most recent SOD approaches. |
Recent advances in small object detection based on deep learning: A review [65] | IVC 2020 | It reviews the SOD from five perspectives and analyzes the evaluation results for two general datasets. | |
A survey and performance evaluation of deep learning methods for small object detection [66] | ESWA 2021 | The solutions are summarized for the four challenges of SOD and some experiment analyses are provided. | It only analyzes the performance of three classical object detection algorithms (Faster R-CNN, SSD, YOLO). |
Deep learning-based detection from the perspective of small or tiny objects: A survey [67] | IVC 2022 | Aims to discuss small- or tiny-object datasets, detection techniques and the performance of these techniques. | These surveys systematically reviewed the development of SOD. Nevertheless, they all lack a comprehensive review of techniques deliberately designed for critical SOD tasks. |
A guide to image and video based small object detection using deep learning: Case study of maritime surveillance [68] | arXiv 2022 | Reviews the SOD methods and investigates the performance of SOD in maritime environments. | |
Towards large-scale small object detection: Survey and benchmarks [69] | arXiv 2022 | It presents a detailed study of SOD and yields two large-scale benchmarks for a driving scenario and aerial scene. | |
Deep learning based small object detection: A survey | Ours | We comprehensively discuss the definition of small objects, the challenges encountered in detecting small objects, the strengths and weaknesses of generic SOD algorithms and three crucial SOD tasks. We also analyze the performance of SOD on three datasets and summarize meaningful conclusions. |
Method | Publication | Techniques | Strengths | Weaknesses |
SSD [36] | ECCV16 | Pyramidal feature hierarchy without fusing features. | SSD can detect objects of various sizes. | The low-level prediction feature map has no strong semantics. |
FPN [35] | CVPR17 | Feature pyramid network (including feature fusion and multi-scaled fusion modules, etc.). | FPN dramatically improves the detection accuracy of small objects. | The feature representation capability will be diminished by the semantic gap between feature layers of various scales. |
RetinaNet [74] | ICCV17 | RetinaNet alleviates the foreground-background class imbalance problem. | ||
FSSD [75] | arXiv17 | Lightweight feature fusion module. | ||
MDSSD [76] | arXiv18 | MDSSD incorporates contextual information that is more conducive to SOD. | Lower detection speed than SSD. | |
[77] | CVPR21 | FPN with a learnable fusion factor. | The fusion factor further improves FPN performance for small objects. | |
HRDNet [78] | arXiv20 | FPN with image pyramid. | HRDNet acquires more details for small objects with high resolution. | Large numbers of parameters. |
IPG-Net [79] | CVPR20 | IPG-Net alleviates the vanishment of the small object features. | This method is inefficient. | |
RHF-Net [80] | CVPR20 | Recursive hybrid fusion pyramid network. | Low computational cost and high accuracy. | |
QueryDet [81] | CVPR22 | Query mechanism. | Accelerating inference with sparse query. | |
EFPN [82] | TMM22 | Super-resolution (include super-resolution layer and enhancing representations of small objects to be similar to large ones). | EFPN adds a high-resolution layer to FPN to increase the accuracy of the SOD. | Super-resolution feature extraction leads to more computational costs. |
[83] | CVPR17 | The GAN-based approaches effectively enhance the level of detail of information on small objects. | ||
MTGAN [84] | ECCV18 | |||
TPS [85] | ICCV19 | |||
MRAE [86] | CVPR22 | FPN with attention weight. | It provides a practical solution for multi-resolution feature extraction without using a GAN, and it is time-efficient. |
Method | Publication | Techniques | Strengths | Weaknesses |
SNIP [95] SNIPER [96] |
CVPR18 | Scale normalization training strategy for image pyramids. | SNIP and SNIPER can effectively improve the detection performance of small objects. | It requires an input image pyramid that brings a high computational cost. |
SAN [97] | CVPR18 | Scale-aware training. | SAN makes the network more robust against scale invariance. | |
Trident [98] | ICCV19 | Multi-branch architecture and scale-aware training. | Multi-branch technique makes the receptive field size align with the object size. | It may bring about the over-fitting problem in each branch, as caused by too few effective samples. |
POD [99] | ICCV19 | Global scale learning. | This method makes the network more sensitive to scale invariance. |
Method | Publication | Techniques | Strengths | Weaknesses |
ION [101] | CVPR16 | Integrate contextual information. | ION exploits context and multi-scale representations to improve SOD. | Underutilization of early feature layers. |
DSSD [102] CSSD [103] |
arXiv17 WACV18 |
Fusing contextual information in different ways to improve SOD performance. | Slower detection speed than SSD. | |
SMN [104] | ICCV17 |
Spatial memory for contextual reasoning. | SMN models the instance-level context to improve the performance of SOD. | The gradient will vanish as the reasoning signal and perceptual signal cancel each other out. |
IRR [105] | arXiv20 | Contextual reasoning integrates intrinsic relations. | IRR updates the initial regional features to boost SOD. | Small objects are associated with difficulty in extracting semantic features. |
FA-SSD [106] | ICAIIC21 | Context with attention. | FA-SSD is more accurate than SSD. | It has lower accurate than DSSD. |
Method | Publication | Techniques | Strengths | Weaknesses |
Kisantal et al. [110] | arXiv19 | Oversampling and random copy-pasting. | This approach achieves better object detection accuracy for small objects. | Random copying and pasting may cause background mismatch. |
RRNet [111] | ICCV19 | Adaptive resampling augmentation strategy. | Free-anchor and adaptive resampling result in excellent performance for very small objects. | |
Ünel et al. [112] | CVPR19 | Tiling-based augmentation. | The method provides a good trade-off between accuracy and time cost. | |
DST [113] | arXiv20 | Uses the feedback information to guide data preparation. | Feedback-driven and dynamic data preparation paradigms mitigate the scale-invariant issue. | |
Zoph et al. [114] | ECCV20 | Automatic data augmentation | This approach has no additional inference cost and minimal training cost. | The strategy is intricate. |
Chen et al. [116] | CVPR21 | It can be transferred to other datasets and tasks and is scale-sensitive. | The high time cost of auto-augmentation approaches for searching. |
Method | Publication | Techniques | Strengths | Weaknesses |
PPDet [117] | BMVC20 | Anchor-free with a new label strategy. | It reduces the contributions of non-discriminatory features during training. |
|
CenterNet++ [118] | CVPR21 | An anchor-free detector that uses triplet key points to represent objects. | This model with multi-resolution performs better. | |
NWD [119] | arXiv21 | A new metric to replace IoU. | These two metrics are more effective than the IoU metric for small object detection. | |
RFLA [120] | ECCV22 | |||
C3Det [121] | CVPR22 | Annotation framework for tiny objects. | It alleviates the expense of tiny-object annotation. | |
SAHI [122] | arXiv22 | Slicing-aided inference. | This scheme is plug-and-play, does not require pre-training and improves the accuracy of detecting small objects. | Larger feature maps require more memory and computing cost. |
Method | Publication | Techniques | Strengths | Weaknesses |
Hu and Ramanan [123] | CVPR17 | Super-resolution by GAN. | The joint super-resolution and refinement model is effective. | No fusion of contextual information. |
S3FD [124] | ICCV17 | Scale-invariant strategy. | The three-trick, scale-equitable framework, max-out, and scale compensation anchor-matching achieve superior performance. | |
MagNet [126] | WACV18 | Feature fusion approach to integrating contextual information. | ConvTranspose is more helpful than skip connections or context pooling. | The improvement is not obvious. |
Zhu et al. [127] | CVPR18 | EMO metric to get a high IoU. | The EMO score inspired several effective strategies for a new anchor design to obtain a higher facial IoU score. | |
TinaFace [128] | arXiv21 | Geometric transformations and multi-scale representation. | Simple improvements of RetinaNet achieve better performance. | |
Zhang et al. [132] | WACV20 | Hard example mining and super-resolution. | It handles the imbalance between images. |
Method | Publication | Techniques | Strengths | Weaknesses |
Song et al. [133] | ECCV18 | A topological line detection. | TLL can automatically adapt to small-scale pedestrians. | No mitigation of information loss for small pedestrians. |
SaYwF [134] | arXiv19 | A three-phase detection model. | Achieves a trade-off between detection accuracy and detection speed. | |
CSP [135] | CVPR19 | Pedestrian detection is converted to high-level semantic feature prediction. | No additional post-processing is required for CSP. | Objects with a large variance in aspect ratio need to be examined. |
FSAF [37] | CVPR19 | Feature-selective anchor-free module. | Dynamically assigning each instance to the most suitable feature level is more robust. | Separate anchor-free branches do not have many advantages over anchor-based branches. |
Yu et al. [136] | WACV20 | Scale match of the pre-trained dataset to the task-specified dataset. | Scale match can better utilize the existing annotated data. | It has poor performance on TinyPerson. |
Method | Publication | Techniques | Strengths | Weaknesses |
Zhang et al. [144] | ICCV19 | Model fusion, cascade network, deformable convolution and data augmentation. | The joint optimization of four strategies makes the model perform well on VisDrone. | The efficiency and detection speed of the network is poor. |
Yi et al. [145] | WACV21 | Oriented anchor-free object detector. | Extended BBAVector technique on CenterNet is simple and effective. | |
ReDet [147] | arXiv21 | Rotation-invariant feature representation. | Smaller models and better results for small- and medium-sized objects. | |
DarkNet-RI [149] | TGRS21 | The multi-scale and rotation-invariant feature representation is robust against scale-variance. | Need to enhance the overlapping and occluded object detection. | |
Li et al. [151] | CVPR22 | Adaptive points learning approach. | This model can classify and localize objects with arbitrary orientation. | It requires large computing cost. |
DotD [152] | CVPRW21 | A new metric DotD. | It's valid for defining positive and negative anchors in training. |
Dataset | Year | Description |
WIDER FACE [47] | 2016 | WIDER FACE is a large-scale dataset of face images. Images are selected from the publicly available WIDER dataset. |
IJB [156] | 2015 | IJB-A/B/C is a dataset for face detection and recognition. IJB-A contains 1845 objects, 11,754 images, 55,026 video frames, 7011 videos and 10,044 non-facial images. |
DarkFace [157] | 2019 | The DarkFace dataset offers 6000 nighttime low-light photos from real-world locations, all labeled with bounding boxes of human faces. Additionally, this dataset has 9000 unlabeled low-light images taken in the same environment. |
UFDD [158] | 2018 | UFDD, an unconstrained face detection dataset, consists of more than 6000 images and 11,000 faces, and it contains seven scenes: rain, snow, haze, blur, illumination, lens impediments and distractors. |
WildestFaces [159] | 2018 | The WildestFaces dataset includes 67,889 pictures. Along with annotations for face detection and recognition, it also includes tags for blur severity, scale and occlusion. |
Dataset | Year | Description |
TinyPerson [136] | 2020 | TinyPerson is a challenging benchmark for tiny object detection in a complex context and at a long distance. A total of 72,651 labeled very small objects are included in the dataset. |
WiderPerson [160] | 2020 | The WiderPerson dataset, which contains 32203 images with a total of 393703 instances. |
EuroCity [161] | 2018 | The EuroCity person dataset was collected in several European countries by in-vehicle cameras; it includes about 47,300 images with more than 238,200 annotated instances of people. |
Citypersons [162] | 2017 | The Citypersons dataset is a subset of a cityscape; it offers 5,000 images from 27 cities with 30 fine-grained, pixel-level annotations. |
Caltech [163] | 2009 | Caltech is a challenging dataset that contains low-resolution, frequently obstructed objects. There are 192,000 and 155,000 pedestrian instances in the training and testing sets, respectively. |
Dataset | Year | Description |
DIOR [58] | 2020 | DIOR is made up of 20 common object categories, 23,463 optimum remote sensing images and 192,472 hand-annotated object instances with axis-aligned bounding boxes. |
VisDrone [164] | 2022 | VisDrone was collected by the AISKYEYE team at Tianjin University in China while utilizing several UAVs; it includes pedestrians, automobiles, bicycles and other categories. |
UAVDT [165] | 2018 | UAVDT is a sizable UAV-based video dataset with 80,000 total frames that are intended for vehicle detection and tracking. |
DOTA [166] | 2018 | DOTA has three versions so far; DOTA-v1.0 includes 188,282 instances of 2806 aerial images in 15 main categories. |
NWPU VHR-10 [167] | 2016 | The NWPU VHR-10 dataset contains a total of 800 very high-resolution optical remote sensing images, which were acquired from Google Earth and Vaihingen. |
UCAS-AOD [168] | 2015 | The UCAS-AOD datasets include many small objects with intricate backgrounds with a total of 2420 images and 14,596 instances. |
Dataset | Year | Scenario | Description |
SOD [42] | 2021 | Generic | SOD is a subset of the SUN [171] and MS COCO datasets. Ten types of objects that appear extremely small in the images were manually chosen by the authors. |
TT100K [56] | 2016 | Traffic Sign | TT100K has 100,000 images and 30,000 traffic sign instances across 128 classes. |
DeepScores [169] | 2018 | Stradivarius | DeepScores includes high-quality images of sheet music, with around 100 million small objects. |
KITTI [170] | 2012 | Traffic Scene | KITTI has up to 15 vehicles and 30 pedestrians in each image captured in Karlsruhe, Germany. |
Model | Year | Backbone | AP | AP50 | AP75 | APs | APm | APl | FPS |
Faster R-CNN [4] | 2015 | R101-FPN | 36.5 | 58.3 | 39.3 | 18.4 | 40.6 | 50.6 | 6 |
Mask R-CNN [5] | 2017 | R101‑FPN | 38.2 | 60.3 | 41.7 | 20.1 | 41.1 | 50.2 | 8.6 |
YOLOv7-tiny [8] | 2022 | 38.7 | 286 | ||||||
YOLOv7-E6E [8] | 2022 | 56.8 | 74.4 | 62.1 | 36 | ||||
FPN [35] | 2017 | R101-FPN | 36.2 | 59.1 | 39.0 | 18.2 | 39.0 | 48.2 | 6 |
SSD [36] | 2015 | ResNet-101 | 31.2 | 50.4 | 33.3 | 10.2 | 34.5 | 49.8 | 28 |
CornerNet* [38] | 2018 | Hourglass | 40.5 | 56.5 | 43.1 | 19.4 | 42.7 | 53.9 | 4.1 |
FCOS [39] | 2019 | R101-FPN | 41.8 | 60.3 | 45.3 | 25.6 | 47.7 | 56.1 | 7 |
Efficientdet [71] | 2020 | Efficientdet | 33.8 | 52.2 | 35.8 | 12.0 | 38.3 | 51.2 | 98 |
RetinaNet [74] | 2017 | R101-FPN | 39.1 | 59.1 | 42.3 | 21.8 | 42.7 | 50.2 | 13.6 |
FSSD [75] | 2017 | VGG16 | 31.8 | 52.8 | 33.5 | 14.2 | 35.1 | 45.0 | 65 |
HRDNet* [78] | 2021 | R101+152 | 47.4 | 66.9 | 51.8 | 32.1 | 50.5 | 55.8 | 2.8 |
RHF-Net [79] | 2020 | ResNet-101 | 37.7 | 59.8 | 40.1 | 19.9 | 42.9 | 51.5 | 29.1 |
QueryDet [81] | 2021 | R50-FPN | 38.2 | 58.6 | 40.9 | 23.7 | 42.0 | 49.5 | 13.6 |
SNIP [95] | 2018 | DPN [174] | 45.7 | 67.3 | 51.1 | 29.3 | 48.8 | 57.1 | 5 |
SNIPER [96] | 2018 | ResNet101 | 46.1 | 67.0 | 51.6 | 29.6 | 48.9 | 58.1 | 5 |
FR-FDWT [99] | 2019 | ResNet-101 | 42.1 | 63.4 | 45.7 | 21.8 | 45.1 | 57.1 | 7 |
ION [101] | 2016 | VGG16 | 24.6 | 46.3 | 23.3 | 7.4 | 26.2 | 38.8 | 1.3 |
DSSD [102] | 2017 | ResNet101 | 33.2 | 53.3 | 35.2 | 13.0 | 35.4 | 51.1 | 6.4 |
FRCNN-DST [113] | 2021 | R101-FPN | 40.1 | 59.3 | 43.2 | 25.6 | 43.9 | 50.9 | 9 |
Retina-DST [113] | 2021 | R101-FPN | 41.3 | 59.9 | 43.8 | 25.4 | 45.1 | 54.0 | 13.6 |
FCOS-DST [113] | 2021 | R101-FPN | 41.6 | 60.0 | 44.6 | 26.5 | 45.4 | 53.1 | 7 |
PPDet [117] | 2020 | R101-FPN | 39.6 | 58.0 | 43.4 | 23.9 | 44.1 | 51.0 | 7.5 |
CenterNet++ [118] | 2022 | ResNet-101 | 47.7 | 65.1 | 51.9 | 27.8 | 50.5 | 60.6 | 104 |
DCN* [125] | 2017 | AlignedIncR | 37.5 | 58.0 | 40.8 | 19.4 | 40.1 | 52.5 | 7 |
RefineDet [172] | 2018 | ResNet-101 | 36.4 | 57.5 | 39.5 | 16.6 | 39.9 | 51.4 | 24 |
D2Det [173] | 2020 | R101-FPN | 45.4 | 64.0 | 49.5 | 25.8 | 48.7 | 58.1 | 4 |
CoupleNet [175] | 2017 | ResNet101 | 33.1 | 53.5 | 35.4 | 11.6 | 36.3 | 50.1 | 8.2 |
Regionlets [176] | 2018 | ResNet-101 | 39.3 | 59.8 | – | 21.7 | 43.7 | 50.9 | – |
FitnessNMS [177] | 2018 | ResNet-101 | 41.8 | 60.9 | 44.9 | 21.5 | 45.0 | 57.5 | – |
PPYOLOE [178] | 2022 | CSPRepRes | 43.1 | 60.5 | 46.6 | 23.2 | 45.2 | 56.9 | 208 |
IENet [180] | 2021 | ResNet-101 | 51.2 | 69.3 | 56.1 | 34.5 | 53.8 | 63.6 | 3 |
Method | Year | Backbone | AP | ||
Easy | Medium | Hard | |||
Faster R-CNN [4] | 2015 | ResNet50 | 84.0 | 72.4 | 34.7 |
RetinaNet [74] | 2017 | ResNet50 | 94.8 | 93.8 | 89.6 |
S3FD [124] | 2017 | VGG16 | 93.4 | 92.7 | 85.4 |
TFD with GAN [125] | 2018 | VGG16 | 93.2 | 92.2 | 85.8 |
Face-MagNet [126] | 2018 | ResNet101 | 92.5 | 91.4 | 83.1 |
TinaFace [128] | 2020 | ResNet50 | 96.3 | 95.7 | 92.1 |
Small Hard Face [132] | 2020 | VGG16 | 95.0 | 93.8 | 88.5 |
IENet [180] | 2021 | ResNet50 | 96.1 | 94.7 | 89.6 |
RetinaNet | 2019 | Mobilenet [181] | 87.9 | 80.7 | 40.3 |
PyramidBox [182] | 2018 | ResNet50 | 95.5 | 94.6 | 88.8 |
RetinaFace [183] | 2019 | ResNet50 | 88.6 | 87.0 | 80.1 |
Method | Year | $ M{R}_{50}^{tiny} $ | $ M{R}_{50}^{small} $ | $ M{R}_{25}^{tiny} $ | $ M{R}_{75}^{tiny} $ | $ {AP}_{50}^{tiny} $ | $ {AP}_{50}^{small} $ | $ {AP}_{25}^{tiny} $ | $ {AP}_{75}^{tiny} $ |
Faster R-CNN [4] | 2015 | 87.78 | 71.31 | 77.35 | 98.4 | 43.55 | 56.69 | 64.07 | 5.35 |
FPN [36] | 2017 | 87.57 | 72.56 | 76.59 | 98.39 | 47.35 | 63.18 | 68.43 | 5.83 |
FCOS [39] | 2019 | 96.12 | 84.14 | 89.56 | 99.56 | 17.9 | 35.75 | 40.49 | 1.45 |
RetinaNet [74] | 2017 | 92.66 | 82.84 | 81.95 | 99.13 | 33.53 | 48.26 | 61.51 | 2.28 |
Grid R-CNN [185] | 2018 | 87.96 | 73.16 | 78.27 | 98.21 | 47.14 | 62.48 | 68.89 | 6.38 |
DSFD [186] | 2019 | 93.47 | 78.72 | 78.02 | 99.48 | 31.15 | 51.64 | 59.58 | 1.99 |
FreeAnchor [187] | 2022 | 88.97 | 73.67 | 77.62 | 98.7 | 41.36 | 53.36 | 63.73 | 4.00 |
Li-RCNN [188] | 2019 | 89.22 | 74.86 | 82.44 | 98.78 | 44.68 | 62.65 | 64.77 | 6.26 |
Method | Year | PL | BD | BR | GTF | SV | LV | SH | TC | BC | ST | SBF | RA | HA | SP | HC | mAP |
Faster R-CNN-O [4] | 2015 | 88.4 | 73.1 | 44.9 | 59.1 | 73.3 | 71.5 | 77.1 | 90.8 | 78.9 | 83.9 | 48.6 | 63.0 | 62.2 | 64.9 | 56.2 | 69.1 |
Mask R-CNN [5] | 2020 | 76.8 | 73.5 | 49.9 | 57.8 | 51.3 | 71.3 | 79.7 | 90.4 | 75.1 | 67.3 | 48.5 | 70.6 | 64.8 | 64.5 | 55.9 | 63.4 |
CenterNet-O [40] | 2019 | 81.0 | 64.0 | 22.6 | 56.6 | 38.6 | 64.0 | 64.9 | 90.8 | 78.0 | 72.5 | 44.0 | 41.1 | 55.5 | 55.0 | 57.4 | 59.1 |
RetinaNet-O [74] | 2017 | 88.6 | 77.6 | 42.1 | 58.1 | 74.5 | 71.6 | 79.1 | 90.8 | 82.1 | 74.3 | 54.7 | 60.6 | 62.5 | 69.5 | 60.4 | 68.2 |
S2A-Net [140] | 2019 | 89.1 | 82.8 | 48.3 | 71.1 | 78.1 | 78.3 | 87.2 | 90.8 | 84.9 | 85.6 | 60.3 | 62.6 | 65.2 | 69.1 | 57.9 | 74.1 |
SCRDet [141] | 2019 | 89.9 | 80.7 | 52.1 | 68.4 | 68.4 | 60.3 | 72.4 | 90.9 | 87.9 | 86.9 | 65.0 | 66.7 | 66.3 | 68.2 | 65.2 | 72.6 |
Oriented R-CNN [142] | 2019 | 88.9 | 83.5 | 55.3 | 76.9 | 74.3 | 82.1 | 87.5 | 90.9 | 85.6 | 85.3 | 65.5 | 66.8 | 74.4 | 70.2 | 57.3 | 76.3 |
MRDet [143] | 2019 | 89.5 | 84.0 | 55.4 | 66.7 | 76.3 | 82.1 | 87.9 | 90.8 | 86.9 | 85.0 | 52.3 | 66.0 | 76.2 | 76.8 | 67.5 | 76.2 |
BBAVectors [145] | 2021 | 88.4 | 80.0 | 50.7 | 62.2 | 78.4 | 79.0 | 87.9 | 90.9 | 83.6 | 84.4 | 54.1 | 60.2 | 65.2 | 64.3 | 55.7 | 72.3 |
ReDet [147] | 2021 | 88.8 | 82.6 | 54.0 | 74.0 | 78.1 | 84.1 | 88.0 | 90.9 | 87.8 | 85.8 | 61.8 | 60.4 | 76.0 | 68.1 | 63.6 | 76.3 |
ROI-Trans [148] | 2019 | 88.6 | 78.5 | 43.4 | 75.9 | 68.8 | 73.7 | 83.6 | 90.7 | 77.3 | 81.5 | 58.4 | 53.5 | 62.8 | 58.9 | 47.7 | 69.6 |
RepPoints-O [151] | 2021 | 87.0 | 83.2 | 54.1 | 71.2 | 80.2 | 78.4 | 87.3 | 90.9 | 86.0 | 86.3 | 59.9 | 70.5 | 73.5 | 72.3 | 59.0 | 76.0 |
CAD-Net [189] | 2019 | 87.8 | 82.4 | 49.4 | 73.5 | 71.1 | 63.5 | 76.6 | 90.9 | 79.2 | 73.3 | 48.4 | 60.9 | 62.0 | 67.0 | 62.2 | 69.9 |
Title | Publication | Strengths | Limitations |
Object detection with deep learning: A review [57] | TNNLS 2019 | It reviews the deep learning-based object detection models and the difficulties of SOD. | These reviews offer a thorough summary of object detection. However, they concentrate on regular-sized object detection rather than small objects. |
Object detection in optical remote sensing images: A survey and a new benchmark [58] | ISPRS 2020 | It constructs DIOR, a large dataset of remote sensing. | |
Imbalance problems in object detection: A review [59] | arXiv 2020 | It reviews the imbalance problem of object detection. | |
Continual object detection: A review of definitions, strategies, and challenges [60] | arXiv 2022 | This survey investigates continual object detection. | |
New generation deep learning for video object detection: A survey [61] | TNNLS 2022 | It systematizes the latest video object detection models and analyzes the performance of these models on two datasets. | |
A survey of deep learning-based object detection [62] | IEEE Access 2022 | It reviews detection methods, general datasets and typical applications. | |
A survey of the four pillars for small object detection: Multi-scale representation, contextual information, super-resolution, and region proposal [64] | TSMC 2020 | It discusses the four pillars of SOD and reports on the performance of SOD on three datasets. | These studies do not contain a complete assessment of the most recent SOD approaches. |
Recent advances in small object detection based on deep learning: A review [65] | IVC 2020 | It reviews the SOD from five perspectives and analyzes the evaluation results for two general datasets. | |
A survey and performance evaluation of deep learning methods for small object detection [66] | ESWA 2021 | The solutions are summarized for the four challenges of SOD and some experiment analyses are provided. | It only analyzes the performance of three classical object detection algorithms (Faster R-CNN, SSD, YOLO). |
Deep learning-based detection from the perspective of small or tiny objects: A survey [67] | IVC 2022 | Aims to discuss small- or tiny-object datasets, detection techniques and the performance of these techniques. | These surveys systematically reviewed the development of SOD. Nevertheless, they all lack a comprehensive review of techniques deliberately designed for critical SOD tasks. |
A guide to image and video based small object detection using deep learning: Case study of maritime surveillance [68] | arXiv 2022 | Reviews the SOD methods and investigates the performance of SOD in maritime environments. | |
Towards large-scale small object detection: Survey and benchmarks [69] | arXiv 2022 | It presents a detailed study of SOD and yields two large-scale benchmarks for a driving scenario and aerial scene. | |
Deep learning based small object detection: A survey | Ours | We comprehensively discuss the definition of small objects, the challenges encountered in detecting small objects, the strengths and weaknesses of generic SOD algorithms and three crucial SOD tasks. We also analyze the performance of SOD on three datasets and summarize meaningful conclusions. |
Method | Publication | Techniques | Strengths | Weaknesses |
SSD [36] | ECCV16 | Pyramidal feature hierarchy without fusing features. | SSD can detect objects of various sizes. | The low-level prediction feature map has no strong semantics. |
FPN [35] | CVPR17 | Feature pyramid network (including feature fusion and multi-scaled fusion modules, etc.). | FPN dramatically improves the detection accuracy of small objects. | The feature representation capability will be diminished by the semantic gap between feature layers of various scales. |
RetinaNet [74] | ICCV17 | RetinaNet alleviates the foreground-background class imbalance problem. | ||
FSSD [75] | arXiv17 | Lightweight feature fusion module. | ||
MDSSD [76] | arXiv18 | MDSSD incorporates contextual information that is more conducive to SOD. | Lower detection speed than SSD. | |
[77] | CVPR21 | FPN with a learnable fusion factor. | The fusion factor further improves FPN performance for small objects. | |
HRDNet [78] | arXiv20 | FPN with image pyramid. | HRDNet acquires more details for small objects with high resolution. | Large numbers of parameters. |
IPG-Net [79] | CVPR20 | IPG-Net alleviates the vanishment of the small object features. | This method is inefficient. | |
RHF-Net [80] | CVPR20 | Recursive hybrid fusion pyramid network. | Low computational cost and high accuracy. | |
QueryDet [81] | CVPR22 | Query mechanism. | Accelerating inference with sparse query. | |
EFPN [82] | TMM22 | Super-resolution (include super-resolution layer and enhancing representations of small objects to be similar to large ones). | EFPN adds a high-resolution layer to FPN to increase the accuracy of the SOD. | Super-resolution feature extraction leads to more computational costs. |
[83] | CVPR17 | The GAN-based approaches effectively enhance the level of detail of information on small objects. | ||
MTGAN [84] | ECCV18 | |||
TPS [85] | ICCV19 | |||
MRAE [86] | CVPR22 | FPN with attention weight. | It provides a practical solution for multi-resolution feature extraction without using a GAN, and it is time-efficient. |
Method | Publication | Techniques | Strengths | Weaknesses |
SNIP [95] SNIPER [96] |
CVPR18 | Scale normalization training strategy for image pyramids. | SNIP and SNIPER can effectively improve the detection performance of small objects. | It requires an input image pyramid that brings a high computational cost. |
SAN [97] | CVPR18 | Scale-aware training. | SAN makes the network more robust against scale invariance. | |
Trident [98] | ICCV19 | Multi-branch architecture and scale-aware training. | Multi-branch technique makes the receptive field size align with the object size. | It may bring about the over-fitting problem in each branch, as caused by too few effective samples. |
POD [99] | ICCV19 | Global scale learning. | This method makes the network more sensitive to scale invariance. |
Method | Publication | Techniques | Strengths | Weaknesses |
ION [101] | CVPR16 | Integrate contextual information. | ION exploits context and multi-scale representations to improve SOD. | Underutilization of early feature layers. |
DSSD [102] CSSD [103] |
arXiv17 WACV18 |
Fusing contextual information in different ways to improve SOD performance. | Slower detection speed than SSD. | |
SMN [104] | ICCV17 |
Spatial memory for contextual reasoning. | SMN models the instance-level context to improve the performance of SOD. | The gradient will vanish as the reasoning signal and perceptual signal cancel each other out. |
IRR [105] | arXiv20 | Contextual reasoning integrates intrinsic relations. | IRR updates the initial regional features to boost SOD. | Small objects are associated with difficulty in extracting semantic features. |
FA-SSD [106] | ICAIIC21 | Context with attention. | FA-SSD is more accurate than SSD. | It has lower accurate than DSSD. |
Method | Publication | Techniques | Strengths | Weaknesses |
Kisantal et al. [110] | arXiv19 | Oversampling and random copy-pasting. | This approach achieves better object detection accuracy for small objects. | Random copying and pasting may cause background mismatch. |
RRNet [111] | ICCV19 | Adaptive resampling augmentation strategy. | Free-anchor and adaptive resampling result in excellent performance for very small objects. | |
Ünel et al. [112] | CVPR19 | Tiling-based augmentation. | The method provides a good trade-off between accuracy and time cost. | |
DST [113] | arXiv20 | Uses the feedback information to guide data preparation. | Feedback-driven and dynamic data preparation paradigms mitigate the scale-invariant issue. | |
Zoph et al. [114] | ECCV20 | Automatic data augmentation | This approach has no additional inference cost and minimal training cost. | The strategy is intricate. |
Chen et al. [116] | CVPR21 | It can be transferred to other datasets and tasks and is scale-sensitive. | The high time cost of auto-augmentation approaches for searching. |
Method | Publication | Techniques | Strengths | Weaknesses |
PPDet [117] | BMVC20 | Anchor-free with a new label strategy. | It reduces the contributions of non-discriminatory features during training. |
|
CenterNet++ [118] | CVPR21 | An anchor-free detector that uses triplet key points to represent objects. | This model with multi-resolution performs better. | |
NWD [119] | arXiv21 | A new metric to replace IoU. | These two metrics are more effective than the IoU metric for small object detection. | |
RFLA [120] | ECCV22 | |||
C3Det [121] | CVPR22 | Annotation framework for tiny objects. | It alleviates the expense of tiny-object annotation. | |
SAHI [122] | arXiv22 | Slicing-aided inference. | This scheme is plug-and-play, does not require pre-training and improves the accuracy of detecting small objects. | Larger feature maps require more memory and computing cost. |
Method | Publication | Techniques | Strengths | Weaknesses |
Hu and Ramanan [123] | CVPR17 | Super-resolution by GAN. | The joint super-resolution and refinement model is effective. | No fusion of contextual information. |
S3FD [124] | ICCV17 | Scale-invariant strategy. | The three-trick, scale-equitable framework, max-out, and scale compensation anchor-matching achieve superior performance. | |
MagNet [126] | WACV18 | Feature fusion approach to integrating contextual information. | ConvTranspose is more helpful than skip connections or context pooling. | The improvement is not obvious. |
Zhu et al. [127] | CVPR18 | EMO metric to get a high IoU. | The EMO score inspired several effective strategies for a new anchor design to obtain a higher facial IoU score. | |
TinaFace [128] | arXiv21 | Geometric transformations and multi-scale representation. | Simple improvements of RetinaNet achieve better performance. | |
Zhang et al. [132] | WACV20 | Hard example mining and super-resolution. | It handles the imbalance between images. |
Method | Publication | Techniques | Strengths | Weaknesses |
Song et al. [133] | ECCV18 | A topological line detection. | TLL can automatically adapt to small-scale pedestrians. | No mitigation of information loss for small pedestrians. |
SaYwF [134] | arXiv19 | A three-phase detection model. | Achieves a trade-off between detection accuracy and detection speed. | |
CSP [135] | CVPR19 | Pedestrian detection is converted to high-level semantic feature prediction. | No additional post-processing is required for CSP. | Objects with a large variance in aspect ratio need to be examined. |
FSAF [37] | CVPR19 | Feature-selective anchor-free module. | Dynamically assigning each instance to the most suitable feature level is more robust. | Separate anchor-free branches do not have many advantages over anchor-based branches. |
Yu et al. [136] | WACV20 | Scale match of the pre-trained dataset to the task-specified dataset. | Scale match can better utilize the existing annotated data. | It has poor performance on TinyPerson. |
Method | Publication | Techniques | Strengths | Weaknesses |
Zhang et al. [144] | ICCV19 | Model fusion, cascade network, deformable convolution and data augmentation. | The joint optimization of four strategies makes the model perform well on VisDrone. | The efficiency and detection speed of the network is poor. |
Yi et al. [145] | WACV21 | Oriented anchor-free object detector. | Extended BBAVector technique on CenterNet is simple and effective. | |
ReDet [147] | arXiv21 | Rotation-invariant feature representation. | Smaller models and better results for small- and medium-sized objects. | |
DarkNet-RI [149] | TGRS21 | The multi-scale and rotation-invariant feature representation is robust against scale-variance. | Need to enhance the overlapping and occluded object detection. | |
Li et al. [151] | CVPR22 | Adaptive points learning approach. | This model can classify and localize objects with arbitrary orientation. | It requires large computing cost. |
DotD [152] | CVPRW21 | A new metric DotD. | It's valid for defining positive and negative anchors in training. |
Dataset | Year | Description |
WIDER FACE [47] | 2016 | WIDER FACE is a large-scale dataset of face images. Images are selected from the publicly available WIDER dataset. |
IJB [156] | 2015 | IJB-A/B/C is a dataset for face detection and recognition. IJB-A contains 1845 objects, 11,754 images, 55,026 video frames, 7011 videos and 10,044 non-facial images. |
DarkFace [157] | 2019 | The DarkFace dataset offers 6000 nighttime low-light photos from real-world locations, all labeled with bounding boxes of human faces. Additionally, this dataset has 9000 unlabeled low-light images taken in the same environment. |
UFDD [158] | 2018 | UFDD, an unconstrained face detection dataset, consists of more than 6000 images and 11,000 faces, and it contains seven scenes: rain, snow, haze, blur, illumination, lens impediments and distractors. |
WildestFaces [159] | 2018 | The WildestFaces dataset includes 67,889 pictures. Along with annotations for face detection and recognition, it also includes tags for blur severity, scale and occlusion. |
Dataset | Year | Description |
TinyPerson [136] | 2020 | TinyPerson is a challenging benchmark for tiny object detection in a complex context and at a long distance. A total of 72,651 labeled very small objects are included in the dataset. |
WiderPerson [160] | 2020 | The WiderPerson dataset, which contains 32203 images with a total of 393703 instances. |
EuroCity [161] | 2018 | The EuroCity person dataset was collected in several European countries by in-vehicle cameras; it includes about 47,300 images with more than 238,200 annotated instances of people. |
Citypersons [162] | 2017 | The Citypersons dataset is a subset of a cityscape; it offers 5,000 images from 27 cities with 30 fine-grained, pixel-level annotations. |
Caltech [163] | 2009 | Caltech is a challenging dataset that contains low-resolution, frequently obstructed objects. There are 192,000 and 155,000 pedestrian instances in the training and testing sets, respectively. |
Dataset | Year | Description |
DIOR [58] | 2020 | DIOR is made up of 20 common object categories, 23,463 optimum remote sensing images and 192,472 hand-annotated object instances with axis-aligned bounding boxes. |
VisDrone [164] | 2022 | VisDrone was collected by the AISKYEYE team at Tianjin University in China while utilizing several UAVs; it includes pedestrians, automobiles, bicycles and other categories. |
UAVDT [165] | 2018 | UAVDT is a sizable UAV-based video dataset with 80,000 total frames that are intended for vehicle detection and tracking. |
DOTA [166] | 2018 | DOTA has three versions so far; DOTA-v1.0 includes 188,282 instances of 2806 aerial images in 15 main categories. |
NWPU VHR-10 [167] | 2016 | The NWPU VHR-10 dataset contains a total of 800 very high-resolution optical remote sensing images, which were acquired from Google Earth and Vaihingen. |
UCAS-AOD [168] | 2015 | The UCAS-AOD datasets include many small objects with intricate backgrounds with a total of 2420 images and 14,596 instances. |
Dataset | Year | Scenario | Description |
SOD [42] | 2021 | Generic | SOD is a subset of the SUN [171] and MS COCO datasets. Ten types of objects that appear extremely small in the images were manually chosen by the authors. |
TT100K [56] | 2016 | Traffic Sign | TT100K has 100,000 images and 30,000 traffic sign instances across 128 classes. |
DeepScores [169] | 2018 | Stradivarius | DeepScores includes high-quality images of sheet music, with around 100 million small objects. |
KITTI [170] | 2012 | Traffic Scene | KITTI has up to 15 vehicles and 30 pedestrians in each image captured in Karlsruhe, Germany. |
Model | Year | Backbone | AP | AP50 | AP75 | APs | APm | APl | FPS |
Faster R-CNN [4] | 2015 | R101-FPN | 36.5 | 58.3 | 39.3 | 18.4 | 40.6 | 50.6 | 6 |
Mask R-CNN [5] | 2017 | R101‑FPN | 38.2 | 60.3 | 41.7 | 20.1 | 41.1 | 50.2 | 8.6 |
YOLOv7-tiny [8] | 2022 | 38.7 | 286 | ||||||
YOLOv7-E6E [8] | 2022 | 56.8 | 74.4 | 62.1 | 36 | ||||
FPN [35] | 2017 | R101-FPN | 36.2 | 59.1 | 39.0 | 18.2 | 39.0 | 48.2 | 6 |
SSD [36] | 2015 | ResNet-101 | 31.2 | 50.4 | 33.3 | 10.2 | 34.5 | 49.8 | 28 |
CornerNet* [38] | 2018 | Hourglass | 40.5 | 56.5 | 43.1 | 19.4 | 42.7 | 53.9 | 4.1 |
FCOS [39] | 2019 | R101-FPN | 41.8 | 60.3 | 45.3 | 25.6 | 47.7 | 56.1 | 7 |
Efficientdet [71] | 2020 | Efficientdet | 33.8 | 52.2 | 35.8 | 12.0 | 38.3 | 51.2 | 98 |
RetinaNet [74] | 2017 | R101-FPN | 39.1 | 59.1 | 42.3 | 21.8 | 42.7 | 50.2 | 13.6 |
FSSD [75] | 2017 | VGG16 | 31.8 | 52.8 | 33.5 | 14.2 | 35.1 | 45.0 | 65 |
HRDNet* [78] | 2021 | R101+152 | 47.4 | 66.9 | 51.8 | 32.1 | 50.5 | 55.8 | 2.8 |
RHF-Net [79] | 2020 | ResNet-101 | 37.7 | 59.8 | 40.1 | 19.9 | 42.9 | 51.5 | 29.1 |
QueryDet [81] | 2021 | R50-FPN | 38.2 | 58.6 | 40.9 | 23.7 | 42.0 | 49.5 | 13.6 |
SNIP [95] | 2018 | DPN [174] | 45.7 | 67.3 | 51.1 | 29.3 | 48.8 | 57.1 | 5 |
SNIPER [96] | 2018 | ResNet101 | 46.1 | 67.0 | 51.6 | 29.6 | 48.9 | 58.1 | 5 |
FR-FDWT [99] | 2019 | ResNet-101 | 42.1 | 63.4 | 45.7 | 21.8 | 45.1 | 57.1 | 7 |
ION [101] | 2016 | VGG16 | 24.6 | 46.3 | 23.3 | 7.4 | 26.2 | 38.8 | 1.3 |
DSSD [102] | 2017 | ResNet101 | 33.2 | 53.3 | 35.2 | 13.0 | 35.4 | 51.1 | 6.4 |
FRCNN-DST [113] | 2021 | R101-FPN | 40.1 | 59.3 | 43.2 | 25.6 | 43.9 | 50.9 | 9 |
Retina-DST [113] | 2021 | R101-FPN | 41.3 | 59.9 | 43.8 | 25.4 | 45.1 | 54.0 | 13.6 |
FCOS-DST [113] | 2021 | R101-FPN | 41.6 | 60.0 | 44.6 | 26.5 | 45.4 | 53.1 | 7 |
PPDet [117] | 2020 | R101-FPN | 39.6 | 58.0 | 43.4 | 23.9 | 44.1 | 51.0 | 7.5 |
CenterNet++ [118] | 2022 | ResNet-101 | 47.7 | 65.1 | 51.9 | 27.8 | 50.5 | 60.6 | 104 |
DCN* [125] | 2017 | AlignedIncR | 37.5 | 58.0 | 40.8 | 19.4 | 40.1 | 52.5 | 7 |
RefineDet [172] | 2018 | ResNet-101 | 36.4 | 57.5 | 39.5 | 16.6 | 39.9 | 51.4 | 24 |
D2Det [173] | 2020 | R101-FPN | 45.4 | 64.0 | 49.5 | 25.8 | 48.7 | 58.1 | 4 |
CoupleNet [175] | 2017 | ResNet101 | 33.1 | 53.5 | 35.4 | 11.6 | 36.3 | 50.1 | 8.2 |
Regionlets [176] | 2018 | ResNet-101 | 39.3 | 59.8 | – | 21.7 | 43.7 | 50.9 | – |
FitnessNMS [177] | 2018 | ResNet-101 | 41.8 | 60.9 | 44.9 | 21.5 | 45.0 | 57.5 | – |
PPYOLOE [178] | 2022 | CSPRepRes | 43.1 | 60.5 | 46.6 | 23.2 | 45.2 | 56.9 | 208 |
IENet [180] | 2021 | ResNet-101 | 51.2 | 69.3 | 56.1 | 34.5 | 53.8 | 63.6 | 3 |
Method | Year | Backbone | AP | ||
Easy | Medium | Hard | |||
Faster R-CNN [4] | 2015 | ResNet50 | 84.0 | 72.4 | 34.7 |
RetinaNet [74] | 2017 | ResNet50 | 94.8 | 93.8 | 89.6 |
S3FD [124] | 2017 | VGG16 | 93.4 | 92.7 | 85.4 |
TFD with GAN [125] | 2018 | VGG16 | 93.2 | 92.2 | 85.8 |
Face-MagNet [126] | 2018 | ResNet101 | 92.5 | 91.4 | 83.1 |
TinaFace [128] | 2020 | ResNet50 | 96.3 | 95.7 | 92.1 |
Small Hard Face [132] | 2020 | VGG16 | 95.0 | 93.8 | 88.5 |
IENet [180] | 2021 | ResNet50 | 96.1 | 94.7 | 89.6 |
RetinaNet | 2019 | Mobilenet [181] | 87.9 | 80.7 | 40.3 |
PyramidBox [182] | 2018 | ResNet50 | 95.5 | 94.6 | 88.8 |
RetinaFace [183] | 2019 | ResNet50 | 88.6 | 87.0 | 80.1 |
Method | Year | $ M{R}_{50}^{tiny} $ | $ M{R}_{50}^{small} $ | $ M{R}_{25}^{tiny} $ | $ M{R}_{75}^{tiny} $ | $ {AP}_{50}^{tiny} $ | $ {AP}_{50}^{small} $ | $ {AP}_{25}^{tiny} $ | $ {AP}_{75}^{tiny} $ |
Faster R-CNN [4] | 2015 | 87.78 | 71.31 | 77.35 | 98.4 | 43.55 | 56.69 | 64.07 | 5.35 |
FPN [36] | 2017 | 87.57 | 72.56 | 76.59 | 98.39 | 47.35 | 63.18 | 68.43 | 5.83 |
FCOS [39] | 2019 | 96.12 | 84.14 | 89.56 | 99.56 | 17.9 | 35.75 | 40.49 | 1.45 |
RetinaNet [74] | 2017 | 92.66 | 82.84 | 81.95 | 99.13 | 33.53 | 48.26 | 61.51 | 2.28 |
Grid R-CNN [185] | 2018 | 87.96 | 73.16 | 78.27 | 98.21 | 47.14 | 62.48 | 68.89 | 6.38 |
DSFD [186] | 2019 | 93.47 | 78.72 | 78.02 | 99.48 | 31.15 | 51.64 | 59.58 | 1.99 |
FreeAnchor [187] | 2022 | 88.97 | 73.67 | 77.62 | 98.7 | 41.36 | 53.36 | 63.73 | 4.00 |
Li-RCNN [188] | 2019 | 89.22 | 74.86 | 82.44 | 98.78 | 44.68 | 62.65 | 64.77 | 6.26 |
Method | Year | PL | BD | BR | GTF | SV | LV | SH | TC | BC | ST | SBF | RA | HA | SP | HC | mAP |
Faster R-CNN-O [4] | 2015 | 88.4 | 73.1 | 44.9 | 59.1 | 73.3 | 71.5 | 77.1 | 90.8 | 78.9 | 83.9 | 48.6 | 63.0 | 62.2 | 64.9 | 56.2 | 69.1 |
Mask R-CNN [5] | 2020 | 76.8 | 73.5 | 49.9 | 57.8 | 51.3 | 71.3 | 79.7 | 90.4 | 75.1 | 67.3 | 48.5 | 70.6 | 64.8 | 64.5 | 55.9 | 63.4 |
CenterNet-O [40] | 2019 | 81.0 | 64.0 | 22.6 | 56.6 | 38.6 | 64.0 | 64.9 | 90.8 | 78.0 | 72.5 | 44.0 | 41.1 | 55.5 | 55.0 | 57.4 | 59.1 |
RetinaNet-O [74] | 2017 | 88.6 | 77.6 | 42.1 | 58.1 | 74.5 | 71.6 | 79.1 | 90.8 | 82.1 | 74.3 | 54.7 | 60.6 | 62.5 | 69.5 | 60.4 | 68.2 |
S2A-Net [140] | 2019 | 89.1 | 82.8 | 48.3 | 71.1 | 78.1 | 78.3 | 87.2 | 90.8 | 84.9 | 85.6 | 60.3 | 62.6 | 65.2 | 69.1 | 57.9 | 74.1 |
SCRDet [141] | 2019 | 89.9 | 80.7 | 52.1 | 68.4 | 68.4 | 60.3 | 72.4 | 90.9 | 87.9 | 86.9 | 65.0 | 66.7 | 66.3 | 68.2 | 65.2 | 72.6 |
Oriented R-CNN [142] | 2019 | 88.9 | 83.5 | 55.3 | 76.9 | 74.3 | 82.1 | 87.5 | 90.9 | 85.6 | 85.3 | 65.5 | 66.8 | 74.4 | 70.2 | 57.3 | 76.3 |
MRDet [143] | 2019 | 89.5 | 84.0 | 55.4 | 66.7 | 76.3 | 82.1 | 87.9 | 90.8 | 86.9 | 85.0 | 52.3 | 66.0 | 76.2 | 76.8 | 67.5 | 76.2 |
BBAVectors [145] | 2021 | 88.4 | 80.0 | 50.7 | 62.2 | 78.4 | 79.0 | 87.9 | 90.9 | 83.6 | 84.4 | 54.1 | 60.2 | 65.2 | 64.3 | 55.7 | 72.3 |
ReDet [147] | 2021 | 88.8 | 82.6 | 54.0 | 74.0 | 78.1 | 84.1 | 88.0 | 90.9 | 87.8 | 85.8 | 61.8 | 60.4 | 76.0 | 68.1 | 63.6 | 76.3 |
ROI-Trans [148] | 2019 | 88.6 | 78.5 | 43.4 | 75.9 | 68.8 | 73.7 | 83.6 | 90.7 | 77.3 | 81.5 | 58.4 | 53.5 | 62.8 | 58.9 | 47.7 | 69.6 |
RepPoints-O [151] | 2021 | 87.0 | 83.2 | 54.1 | 71.2 | 80.2 | 78.4 | 87.3 | 90.9 | 86.0 | 86.3 | 59.9 | 70.5 | 73.5 | 72.3 | 59.0 | 76.0 |
CAD-Net [189] | 2019 | 87.8 | 82.4 | 49.4 | 73.5 | 71.1 | 63.5 | 76.6 | 90.9 | 79.2 | 73.3 | 48.4 | 60.9 | 62.0 | 67.0 | 62.2 | 69.9 |