1.
Introduction
With the advancement of science and technology, deep learning has gradually become the mainstream of artificial intelligence, and it has developed well in various fields. Among them, object detection, which detects the position and category of the target, has become more popular in recent years. The traditional image classification or image segmentation classifies the category of the image, but it cannot locate the position of the category in the image. Object detection is more complex and difficult. Existing object detection methods are divided into two-stage detection algorithms that pay more attention to detection accuracy and one-stage detection algorithms that advocate detection speed.
In the traditional object detection method, the first proposed method is a two-stage detection algorithm. The R-CNN model proposed by Girshick et al. [1] in 2014 is the initial work of the two-stage detection algorithm. Through the screening of Selective Search [2], multiple regions of interest are selected and input to the deep learning model AlexNet [3] to extract features. Then, the extracted features are used in the Support Vector Machine (SVM) for classification, and finally a bounding box is used to predict the location of the region of interest.
Due to the emergence of R-CNN, deep learning has developed advanced technologies in object detection, and many scholars have proposed modified algorithms based on R-CNN. Girshick et al. [4] proposed the Fast R-CNN model by combining R-CNN and SPP [5]. It can adapt to the advantages of different spatial pooling layers and solve the limitation of not using fixed-size images to increase the accuracy of detection to 70%. Ren et al. [6] proposed the Faster R-CNN model, which uses the Region Proposal Network (RPN). Through end-to-end training, the Faster R-CNN can share the convolution features of the two during training, which synchronously classify the original frame of interest and greatly improve the detection time. Li et al. [7] proposed the Feature Pyramid Network (FPN) to solve the general problem that the detection positions of many object detection algorithms are located in the top layer of the entire model network. The FPN has a top-down network architecture, which improves the accuracy of the target detection model and has become the basic technology for many subsequent extended models.
There are many related studies using two-stage object detection methods. Ghosh [8] proposed a new gait recognition method in 2022, using a modified Faster R-CNN to detect whether the pedestrians in the video are carrying objects. The proposed model used Long Short-Term Memory (LSTM) and Bidirectional Long Short-Term Memory (BLSTM) to identify the pattern of gait and was tested on four public gait datasets (OU-LP-Bag, OUTD-B, OULP-Age and CASIA-B). The research results show that the use of Faster R-CNN and BLSTM has better results, and the accuracy can reach 97.42%. Chen et al. [9] proposed a Faster GG R-CNN model by combining Genetic Algorithm (GA) and Faster R-CNN to detect textile defects in complex backgrounds. Performances were compared with three object detection models, including Faster R-CNN, MsDet and YOLOV3 in 2022. Faster GG R-CNN achieved a mAP of 94.57%, which was the highest among them all. In order to observe the wear state caused by mechanical equipment, Miao et al. [10] used industrial cameras to capture wear images of different types (Normal, Adhesion, Abrasive, Corrosion), and they modified the Faster R-CNN model by replacing the original ResNet backbone with the VGG16 network and adding FPN to improve the detection ability. Results were compared with YOLOV3 and SSD object detection models. The proposed Faster R-CNN model shows the best results, with a Precision of 99.25%, Recall of 99.00% and F-measure of 99.00%.
Cui et al. [11] created an image dataset of highway ground penetrating radar (GPR), which is an underground survey device that detects road thickness. This study uses Faster R-CNN to detect GPR images to protect the safety of highway traffic and uses ResNet-101 as the main backbone of the network. The Average Precision (AP), Precision and Recall were 88.13, 97.70 and 89.13%, respectively. The results show that the proposed model detects the underground layer of expressways intelligently, and it identifies the images of GPR automatically. In addition, a method based on Faster R-CNN was proposed to detect Lung Nodules using the public database LIDC-IDRI [12]. The CT images marked by doctors were amplified and labeled, and ZF [13] and VGG16 were selected as the backbone models. Results were compared with Faster R-CNN, R-CNN and Fast R-CNN. The model with VGG16 demonstrated the best AP, of 91.20%, which is 8.80, 22.80 and 15.80% higher than APs from Faster R-CNN, R-CNN and Fast R-CNN, respectively. Yang et al. [14] proposed the MT-Faster R-CNN model in 2020, which modified Fast R-CNN with multi-task learning on the KITTI dataset. This model generates 2D and 3D results based on a single vehicle image at the same time, which helps automatic vehicle driving. The above studies of the two-stage algorithms have shown good results in object detection. However, because the model consists of the regional proposal network and the classification network, it often causes longer detection time. In addition, due to the larger size of the model, it needs more advanced equipment.
The one-stage detection algorithm eliminates the RPN in the two-stage, and it can directly detect the category of the object and regress the bounding box, thereby reducing the detection time. The most representative algorithm is You Only Look Once (YOLO). Redmon et al. [15] proposed YOLOV1 in 2016 and used GoogleNet [16] as the backbone network, by extracting bounding boxes from images and directly predicting coordinate locations and categories. YOLOV2 was proposed in 2017 [17] by adding the RPN method proposed by Faster R-CNN. YOLOV2 uses the anchor box function, and it upgrades the original 224 × 224 resolution to 448 × 448, which greatly improve the detection speed. The weakness is that the results are not favorable on predicting small size objects. YOLOV3 [18] modified the backbone network GoogleNet to the DarkNet-53 of the ResNet model [19] and added FPN to predict different features to improve the detection of small objects. Focusing on the improvement of parameter quantity and accuracy, Bochkovskiy et al. [20] proposed YOLOV4 in 2020 by combining the DarkNet-53 backbone network with CSPNet, proposed by Wang et al. [21]. The new backbone CSP DarkNet-53 reduces the computational complexity and memory cost, increases the accuracy and can use GPUs for model training and testing.
Many scholars have applied the YOLO series in various fields. For example, based on YOLOV4-Tiny model, Lin et al. [22] proposed a method using the K-median to identify and find the appropriate anchor box for the end images of bundled logs. The proposed model used three prediction heads and connected each head with SPP to extract small targets. Results show that the Precision, Recall and F1-Score were 93.97, 94.91 and 95.00%, respectively. Kumar et al. [23] modified the YOLOV4-Tiny model and proposed ETL-YOLOV4 to detect masks by modifying the backbone network, adding a dense SPP network and using Mish as the activation function, adding Mosaic and CutMix images to increase the training performance. The mAPs of the proposed ETL-YOLOV4 model evaluated on FMD and MOXA open datasets reached 67.64 and 65.14%, respectively. The performance of the proposed model outperforms YOLOV3 and YOLOV4-Tiny models. Wang et al. [24] proposed the DSE-YOLO model using pointwise convolution and dilated convolution and adding exponentially enhanced binary cross-entropy (EBCE) and double enhanced mean squared error (DEMSE) loss functions to detect smaller fruits and distinguish different growth stages of fruits accurately. The results show that the mAP, F1-Score, and the parameter size on detection of multi-stage strawberry fruit images were 86.58, 81.59 and 224.39 MB, respectively. Compared with Faster R-CNN, SSD300, SSD512, YOLOV3, YOLOV4 and YOLOV5, DSE-YOLO achieves a balance between accuracy and number of parameters.
Su et al. [25] proposed the YOLO-LOGO model by combining YOLOV5-L6 and Local Global (LOGO) on breast cancer detection. In the model, YOLOV5-L6 locates the tumor in the breast cancer image and then uses LOGO for segmentation. The F1-Socre and IoU were 74.52 and 69.37% on the CBIS-DDSM dataset and 69.37 and 61.09% on the INBreast dataset. Wu et al. [26] proposed the FMD-YOLO model by combining Res2Net and Im-Res2Net-101 to extract features on mask detection. The FMD-YOLO model achieved APs of 92.00 and 88.40% on two open datasets, and it dominated eight object detection models, including Faster R-CNN, Faster R-CNN with FPN, YOLOV3, YOLOV4, RetinaNet, FCOS, EfficientDet and HRNet. Wang et al. [27] proposed the LDS-YOLO model in 2022 to identify images of dead trees taken by drones, which confirms the area of dead trees in order to replant new trees in time. The LDS-YOLO model introduced SPP to increase the detection of smaller targets in UAV images. Considering it is to be combined with UAVs, depthwise separable convolution is used to reduce the model size to 7.60 MB. A similar application was found in Zhao et al. [28], using depthwise separable convolution, and the proposed model reduced the model size to 11 MB while maintaining an accuracy of 95.47% on detection of abnormal fish behavior just in time.
The existing YOLO models consider the trade-off between detection speed and accuracy for real-time detection. Some of them have applied SPP, added SE or used hybrid models to enhance the detection accuracy. There are many related YOLO applications on fruit detection. For example, Tian et al. [29] combined the DenseNet with YOLOV3 to detect growth stages of young, growing and mature apples effectively. Mirhaji et al. [30] used transfer learning on YOLOV2, YOLOV3 and YOLOV4 to detect oranges under different lighting conditions, and they used regression analysis to predict the number of oranges. In addition, YOLO-Tomato models detected tomatoes in a complex environment [31]. Modified YOLOV4 models detected diseases in fruits under challenging environments [32,33]. In order to be more applicable to real-time fruit detection, the size of the model is expected to be small. A modified DenseNet-fused YOLOV4 detected growth stages of mango under a complex environment efficiently [34]. In order to identify grapes in complex backgrounds accurately, Li et al. [35] modified the YOLOV4-Tiny model and added an attention module (Squeeze-and-Excitation) to improve the detection ability on hidden grapes. Furthermore, the depth-wise separable convolution module is used to reduce the number of parameters to improve the real-time performance. Li et al. [36] modified the YOLOV4-Tiny model in 2021 to detect green peppers in complex backgrounds, and they added a multi-scale Adaptive Spatial Feature Fusion (ASFF) to enhance the detection ability for green peppers in small scale. However, using low-dimensional feature maps to increase the feature information of small targets will increase background noise and deteriorate the accuracy of object detection. Therefore, a new channel attention module, the Convolutional Block Attention Module (CBAM), is introduced to solve this problem.
The Faster R-CNN consists of the RPN and the classification network, which prolongs the detection speed and cannot perform real-time detection for high-resolution fruit images. YOLO eliminates the RPN to detect the category of objects and regress the bounding box efficiently. Based on the advantage of fast detection speed of YOLO facilitating real-time detection, this study chooses and modifies the one-stage detection model YOLO for fruit image detection at different growth stages.
The current study focuses on efficient and effective detection of fruit growth stages. Based on YOLOV4-Tiny, this study proposes a one-stage detection model, GCS-YOLOV4-Tiny, to examine fruits with different sizes in a complex environment. For example, there were fruits occulted by leaves or branches, fruits with small sizes, fruits in poor light or multiple fruits in one image. This study modifies the backbone network CSP DarkNet-53-Tiny in YOLOV4-Tiny. The proposed GCS-YOLOV4-Tiny model uses DIOUS-Non Maximum Suppression and chooses a k-means clustering algorithm to find a suitable anchor frame when detecting the object. In addition, the proposed model adds SE between CBL blocks to combine features of different resolutions and finally improves the performance. To enhance feature diversity, two SPP modules, which avoid the distortion caused by image scaling and fuse local and global features, were added before the full connection layer. Furthermore, the proposed model selects group convolution with fewer computational resources to reduce the model size greatly. With the smallest model size of 20.70 MB, the detection results outperform the state-of-the-art YOLOV4-Tiny model with a 17.45% increase in mAP and a 13.80% increase in F1-score. The proposed model provides an effective and efficient performance to detect different growth stages of fruits, which is beneficial for real-time detection.
The rest of the paper is organized as follows. Section 2 expresses the related studies on SE, SPP, group convolution and YOLOV4-Tiny. Section 3 introduces the datasets and the details of the proposed GCS-YOLOV4-Tiny model. Section 4 presents the experiment results on three open datasets. Finally, Section 5 summarizes the proposed model and suggests future research.
2.
Related works
2.1. YOLOV4-Tiny
Wang et al. [37] proposed a one-stage detection YOLOV4-Tiny model, in 2020, which is a simplified version of YOLOV4. This model maintains the detection accuracy with a faster detection speed. The best advantage is that the model uses fewer parameters, performs instant detection and has the feasibility to integrate with embedded devices. The following scholars have used the YOLOV4-Tiny model in different fields. Zhang et al. [38] used drones to capture images of ripe strawberries, immature strawberries and flowers. Based on the YOLOV4-Tiny model, the proposed RTSD-Net possesses fewer convolution layers and faster detection speed to benefit the development of robotic harvesting of strawberries. Yao et al. [39] proposed a modified YOLOV4-Tiny model for the detection of real-time traffic signs. Using the backbone network CSP-DarkNet-53-Tiny of the YOLOV4-Tiny model, the proposed model presents two feature layers of different scales to detect smaller objects.
The ECA-Net model combines the main feature extraction of the YOLOV4-Tiny model with the attention mechanism to improve the feature extraction ability by modifying the YOLOV4-Tiny model to shoot images of insulators in high-voltage transmission lines with an aerial camera. The ECA-Net model improves the model accuracy from 81.01 to 91.19%, and the model size is 24.90 MB, which is suitable for embedded devices and reduces the work time of line transportation and inspection personnel [40]. Zhang et al. [41] used a modified YOLOV4-Tiny model in 2021 to identify the regions where the dials and indicators are located in a water meter image. The output prediction network in the YOLOV4-Tiny model is added as feature maps of three different scales, and the improved YOLOV4-Tiny detects the dial area of the image with high confidence and identifies the type of dial correctly. Li et al. [35] added an SE module to the YOLOV4-Tiny model to improve the performance on the detection of covered grapes. In addition, the depthwise separable convolution module is used to reduce the model size for the real-time detection of grapes by robots. In order to detect green peppers in complex backgrounds, Li et al. [36] added an adaptively spatial feature fusion (ASFF) pyramid to YOLOV4-Tiny to detect small images of green pepper. The accuracy is as high as 96.91%, and the model size is 30.90 MB.
2.2. SE
SE consists of Squeeze and Excitation [42]. In recent years, many scholars have used the SE method to improve the performance of proposed models, and they have demonstrated good results in various fields. Wang et al. [43] proposed a SAR-U-Net network in 2021 to segment liver CT images automatically by adding SE in each convolutional unit of the U-Net encoder to self-adjust to learn image features and suppress irrelevant regions in a segmentation task. Adding a bottleneck block in the SE module to balance the nonlinear representation ability of the two fully connected layers and improve the detection ability, Ma et al. [44] proposed a MaSE-ResNeXt model for solving the rock slice image classification problem. The SE-ResNeXt model proposed by Khan et al. [45] in 2021 is to solve the recognition problem of Bengali handwritten composite characters by fusing channel spatial information and inter-channel dependencies through SE within local receptive fields. The SECNN model detects five diseases of pepper leaves with fewer parameters of 5.40 MB, and it achieves good results in the pepper leaf disease dataset in 2022 [46]. Huang et al. [47] proposed SESPNets for ship detection based on SE for optical remote sensing images. Alsarhan et al. [48] integrated SE modules into graph convolutional networks to obtain discriminative channel-wise features of the input feature matrix by highlighting the important features to enhance the recognition accuracy.
2.3. SPP
SPP was proposed by He et al. [5] in 2014 to solve the fixed size of the input image in two-stage objection detection RCNN, which causes some images to be deformed due to clipping. This could be solved by adding SPP after the convolutional layer and cutting the input image into multi-sizes to connect with the fully connected layer. Yee et al. [49] proposed a DeepScene model on scene classification via incorporating SPP into CNN to enable the multi-size training of the model. Prasetyo et al. [50] proposed a wing convolutional layer to enhance feature diversity and modified SPP to Tiny-SPP for reduction of computational resources on detection of fish eye, tail and body. The SPP-LSTM-NET combined SPP and LSTM network to predict PM 2.5 concentration, and it achieved better results than that of the traditional LSTM [51].
2.4. Group convolution
Group convolution, first applied to the AlexNet model in 2012, was applied to split the network for execution on two GPUs. The method is to group the input feature maps and convolve each group of feature maps separately. Dividing the convolution into G groups, the parameter amount of this layer is reduced to the original 1/G. Scholars combined or modified group convolution in different fields. For example, Li et al. [52] proposed a classification method based on Interleaved Group Convolutions (IGCs) in 2019 to detect crop image datasets captured by sensors. IGCs shorten the training time of the model without reducing the classification accuracy, especially for training samples with long time series. Yang et al. [53] proposed a lightweight group convolutional network (LGCN) for single image super-resolution (SISR) in 2019. Group convolution reduces the number of parameters of LGCN and gathers local information on SISR images gradually. In addition, the enhanced super-resolution group CNN (ESRGCNN) uses a 6-layer group augmented convolution block to enhance the representation of low-frequency features to improve the performance and speed of SISR [54].
3.
Materials and methods
3.1. Datasets
1) The Mango YOLO dataset was created by Koirala et al. [55]. This dataset has only one category of mango. The images were taken in a dark environment, and the image size is 512 × 512 pixels. The dataset contains 1730 images in total.
2) The Rpi-Tomato dataset contains tomato images of four different maturity levels (green, red, light red and red), and the number of images is 257 in total [56].
3) The F. margarita dataset contains images of three different growth stages (mature, immature and growing) of F. margarita [57]. Some images include more than one stage. Images are claasified into seven categories: (a) mature, (b) immature, (c) growing, (d) mature and immature, (e) mature and growing, (f) immature and growing and (g) mature, immature and growing. The original number of images is 1031, and data augmentation increases the total number of images to 6,617.
Table 1 displays the numbers of images, and Figure 1 shows examples of images for the three datasets. images (a1)–(a4) are from the Mango YOLO dataset; images (b1)–(b4) are examples of green, red, light red and red tomatoes from the Rpi-Tomato dataset; images (c1)–(c7) show examples of F. margaritas from the F. margarita dataset. Images in the Mango YOLO dataset exhibit occluded mango images in a low light environment; datasets of Rpi-Tomato and F. margarita represent scenes with natural images in real environments. Images of the above three datasets are used to test and evaluate the performance of object detection for the proposed GCS-YOLOV4-Tiny model.
3.2. GCS-YOLOV4-Tiny
The main purpose of this research is to propose a lightweight object detection model for real-time detection with favorable detection accuracy. The proposed GCS-YOLOV4-Tiny model is tested and evaluated in three different public datasets. The following describes the proposed model in detail.
1) Based on the studies in Section 2.1, YOLOV4-Tiny demonstrates the detection accuracy with a faster detection speed. The proposed GCS-YOLOV4-Tiny model uses DIOUS- Non Maximum Suppression (DIOU- NMS) [58] to replace NMS and chooses the k-means clustering algorithm to find a suitable anchor frame when detecting the object. This study modifies the backbone network CSP DarkNet-53-Tiny in YOLOV4-Tiny, as shown in Figure 3.
Non maximum suppression (NMS) is commonly used in object detection models to solve the problem of multiple prediction frames around the predicting targets. The NMS removes the redundant frames using the Intersection over Union (IoU) metric. This study uses distance IoU (DIOU) [58] to consider the overlap area and the distance between two central points of bounding boxes. The formula is as follows:
where RDIoU is the penalty for the predicted box and the target box, b and bgt are the central points of the predicted box and the target box, p is the distance, c is the diagonal length of the smallest enclosing box covering the two boxes, and d is the distance of central points of two boxes. Figure 2 illustrate the boxes and distances.
Si is the classification score; M presents the predicted box with the highest classification score; Bi is removed by simultaneously considering the IoU and the distance between central points of two boxes; ε is the NMS threshold.
2) Based on the studies in Section 2.2, it can be found that SE learns the features of the image and fuses the features of different channels automatically on image classification, image segmentation and object detection. Figure 4 shows the architecture of squeeze, excitation and SE. The proposed GCS-YOLOV4-Tiny model adds SE between the CBL block with 304 × 304 × 32 and the CBL block with 152 × 152 × 64 to combine features of different resolutions, which finally improves the performance of the model.
3) The SPP module avoids the distortion problem caused by image scaling and fuses local and global features. The proposed GCS-YOLOV4-Tiny model adds two SPPs before the full connection layer to enhance feature diversity. Figure 5 represents the architecture of the proposed SPP.
4) To speed up the training time, instead of using traditional convolution, the proposed GCS-YOLOV4-Tiny model selects group convolution. As shown in Figure 6, the left-hand side describes the traditional convolution. The feature maps of input and output are 12 and 6, respectively. Using group convolution, the feature maps of input and output are 12 and 3, respectively, in the right-hand side of Figure 6. Group convolution requires fewer computational resources and reduces model size greatly.
5) In summary, the proposed GCS-YOLOV4-Tiny model (a) selects YOLOV4-Tiny as backbone, (b) uses DIOUS- Non Maximum Suppression, (c) adds two SE blocks, (d) adds two SPPs and (e) replaces traditional convolution by group convolution. Figure 7 displays the architecture of the proposed GCS-YOLOV4-Tiny model.
3.3. Evaluation metrics
To evaluate the performance of the proposed GCS-YOLOV4-Tiny model, Eqs (3)–(8) illustrate the commonly used indices, including Precision, Recall, F1-Score, Accuracy Precision (AP), mean Average Precision (mAP) and Intersection over Union (IoU).
Precision represents the number of positive class predictions that actually belong to the positive class.
Recall represents the number of positive class predictions made out of all positive examples in the dataset.
F1-Score combines Precision and Recall.
TP (true positive) represents the number of positive categories that are correctly classified as positive; FP (false positive) represents the number of negative categories that are incorrectly classified as positive; FN (false negative) refers to the number of positive categories that are incorrectly classified as negative.
Accuracy Precision (AP) is the area of the curve contained in the precision and recall, and it represents the accuracy of the detection.
mAP represents the average of Accuracy Precision.
where M is the number of images, and K is the number of categories.
Intersection over Union (IoU) is the fraction of the Area of Overlap divided by the Area of Union.
where the Area of Overlap (gray box in Figure 9) is the intersection between the predicted bounding box (red box in Figure 8) and the ground-truth bounding box (blue box in Figure 8), and the Area of Union (gray box in Figure 10) is the area encompassed by both the predicted bounding box and the ground-truth bounding box.
4.
Experiments and results
Three datasets, Mango YOLO, Rpi-Tomato and F. margarita, were used in this study. The input image size was adjusted to 608 × 608 for GCS-YOLOV4-Tiny with a batch size of 64 epoch count of 6000 learning rate of 0.001 and decay rate of 0.0005. The equipment used in the experiment is an Intel(R) Core (TM) i7-8700 @ 3.20 GHz CPU, NVIDIA GeForce RTX 2080. The whole experiments were performed using Python 3.8 [Python Software Foundation, Fredericksburg, Virginia, USA].
4.1. Ablation experiment
To compare the performances of using DIOU-NMS, adding SE, adding SPP, or using group convolution in YOLOV4-Tiny on the F. margarita dataset, this study conducted sixteen experiments and evaluated the performance of the proposed GCS-YOLOV4-Tiny model, as shown in Table 2. The APs for mature, immature and growing groups were 77.97, 87.25 and 63.98%, respectively, in the original YOLOV4-TINY model (experiment #1). The lowest AP occurs in the growing group for each of the experiments. The APs for mature and immature groups were above 90% for all experiments except experiment #1. The mAP was 76.40% in the original YOLOV4-TINY model (experiment #1), and it increased to 85.26% when using DIOU-NMS (experiment #2). The mAP values of most of the experiments (experiments #4, #9, #11, #12, #14, #15 and #16) when adding SPP were greater than 90%. The highest mAP was 93.54%, when using DIOU-NMS and group convolution and adding SE and SPP into YOLOV4-TINY (experiment #16). It is worth noting that the AP of the growing group has been greatly enhanced from 63.98 to 87.69%, which makes a substantial contribution to mAP.
Figure 11 presents the results of the ablation study including the APs for mature, immature and growing groups, mAP and model size for each experiment. The minimum model size is 18.20 MB, from experiment #5, which is the original YOLOV4-TINY model using group convolution. Unfortunately, the mAP of experiment #5 is not favorable. The model size for experiment #16, with the highest mAP, is 20.7 MB. Therefore, experiment #16 is the final architecture of the proposed GCS-YOLOV4-Tiny model.
4.2. Results on Mango YOLO
The proposed GCS-YOLOV4-Tiny model was first evaluated using the Mango YOLO dataset, which has only one category. The performance values of GCS-YOLOV4-Tiny, including AP, Recall, F1-Score, Precision and Average IoU are 91.91, 79.00, 88.00, 98.00 and 81.10%, respectively. Table 3 compares the performances of the related studies using the same dataset.
The proposed GCS-YOLOV4-Tiny achieves the highest AP, while the best Precision is using YOLOV4-Tiny. Figure 12 displays detection results from the proposed GCS-YOLOV4-Tiny model. The proposed model detects most of the mangos, even though some of them are covered by leaves.
4.3. Results on Rpi-Tomato
The second dataset used in this study is Rpi-Tomato, with tomato images of four different maturity levels (green, red, light red and red). Related studies include Moreira et al. [56], which applied SSD MobileNet v2, YOLOV4 and HSV Color Space. This study executed YOLOV4-Tiny and the proposed GCS-YOLOV4-Tiny. Results are shown in Table 4. Among the four classes, the lowest AP occurred in the "Turning" class, while the highest AP comes from the "Light Red" class in both YOLOV4-Tiny and GCS-YOLOV4-Tiny models. YOLOV4-Tiny and GCS-YOLOV4-Tiny models outperform SSD MobileNet v2, YOLOV4 and HSV Color Space on most of the performance indices.
The GCS-YOLOV4-Tiny model results are similar to the YOLOV4-Tiny model results on the Rpi-Tomato dataset. Figure 13 displays detection results from the proposed GCS-YOLOV4-Tiny model for four classes in the Rpi-Tomato dataset. Figure 13 (b1)–(b4) present green, red, light red and red classes, respectively.
4.4. Results on F. margarita
YOLOV3, YOLOV3-Tiny, YOLOV4, YOLOV4-Tiny and GCS-YOLOV4-Tiny models were tested using the F. margarita dataset [57] with five-fold cross validation. Results from the proposed GCS-YOLOV4-Tiny are shown in Table 5. The AP values for mature, immature and growing were 98.34 ± 0.75, 93.72 ± 1.53 and 87.80 ± 2.11%, respectively, and the mAP was 93.42 ± 0.44%. The Recall, F1-score, Precision, and Average IoU of GCS-YOLOV4-Tiny were 91.00 ± 1.87, 90.80 ± 2.59, 90.80 ± 2.77 and 76.94 ± 1.35%, respectively.
Table 6 records the training time for each fold in the GCS-YOLOV4-Tiny model. The average training time for each fold is 3hr 34min.
Table 7 compares the performance metrics for models of YOLOV3, YOLOV3-Tiny, YOLOV4, YOLOV4-Tiny and the proposed GCS-YOLOV4-Tiny models. ANOVA evaluates the differences among models. The proposed GCS -YOLOV4-Tiny model scores the highest mAP of 93.42%. The same results were found for mature AP, immature AP, growing AP, Recall, F1-Score, Precision and Average IoU. Notably, the proposed GCS-YOLOV4-Tiny model largely improved the AP to 87.80% in the growing group, as the values were around 60–70% in the other four models.
Figure 14 uses box plots to illustrate the performances of the above five models. Obviously, the proposed GCS-YOLOV4-Tiny model achieves higher averages with smaller standard deviations and dominates the other four models in all metrics.
In addition to the performance metrics in Table 8, this study compares the average training times, Billion Float Operations (BFLOPs) and model sizes among the five models in Table 8. YOLOV4 has the longest average training time of 14hr 26min, the highest BFLOPs of 127.26 and the maximum model size of 244 MB. The average training time of GCS-YOLOV4-Tiny is 3hr35min, which is a little bit longer than that of YOLOV4-Tiny. A similar situation occurs for BFLOPs. Although YOLOV4-Tiny has a shorter average training time and fewer BFLOPs than those of GCS-YOLOV4-Tiny, the model size of GCS-YOLOV4-Tiny is 20.70 MB, which is smaller than that of YOLOV4-Tiny (22.40 MB). Figure 15 plots the mAP values and model sizes for the five models. Obviously, the proposed model achieves the highest mAP of 93.42% with the smalles model size of 20.70 MB. Figure 16–20 show detection examples on the F. margarita dataset by YOLOV3, YOLOV3-Tiny, YOLOV4, YOLOV4-Tiny, and the proposed GCS-YOLOV4-Tiny models.
5.
Conclusions
Based on YOLOV4-Tiny, this study proposes a one-stage detection model, GCS-YOLOV4-Tiny, to examine fruits with different sizes in a complex environment. The proposed GCS-YOLOV4-Tiny model uses DIOUS-NMS and adds SE and SPP modules. Furthermore, the group convolution was applied to reduce model size greatly. With the smallest model size of 20.70 MB, the detection results outperform the state-of-the-art YOLOV4-Tiny model with a 17.45% increase in mAP and a 13.80% increase in F1-score on the F. margarita dataset. The proposed model provides an effective and efficient performance to detect different growth stages of fruits, which is beneficial for real-time detection.
This study mainly selects the one-stage detection YOLO algorithm and focuses on construction of a lightweight network to perform real-time detection on fruit growth stages. The two-stage detectors RCNN, SSD and mask-RCNN were not compared in this study. Furthermore, due to the limitation of hardware equipment, this research did not use the latest YOLO version. Although the performance of the proposed model is favorable, there is the possibility to modify the architecture of the proposed model or of using the latest network version to achieve better performance in the future.
Acknowledgments
The research was partially funded by the National Science and Technology Council of Taiwan, R.O.C. (Research Grant Project number MOST 111-2221-E-167-007-MY3).
Conflict of interest
The authors declare there is no conflict of interest.
Ethics Statement
This study did not conduct experiments involving humans and animals.