A cross-modal conditional mechanism based on attention for text-video retrieval

Wanru Du; Xiaochuan Jing; Quan Zhu; Xiaoyin Wang; Xuan Liu; Wanru Du; Xiaochuan Jing; Quan Zhu; Xiaoyin Wang; Xuan Liu

doi:10.3934/mbe.2023889

Mathematical Biosciences and Engineering

2023, Volume 20, Issue 11: 20073-20092. doi: 10.3934/mbe.2023889

Previous Article Next Article

Research article Special Issues

A cross-modal conditional mechanism based on attention for text-video retrieval

1.
China Aerospace Academy of Systems Science and Engineering, Beijing 100048, China
2.
Aerospace Hongka Intelligent Technology (Beijing) CO., LTD., Beijing 100048, China

Academic Editor: Byung-Gyu Kim

Received: 14 August 2023 Revised: 10 October 2023 Accepted: 17 October 2023 Published: 03 November 2023

Current research in cross-modal retrieval has primarily focused on aligning the global features of videos and sentences. However, video conveys a much more comprehensive range of information than text. Thus, text-video matching should focus on the similarities between frames containing critical information and text semantics. This paper proposes a cross-modal conditional feature aggregation model based on the attention mechanism. It includes two innovative modules: (1) A cross-modal attentional feature aggregation module, which uses the semantic text features as conditional projections to extract the most relevant features from the video frames. It aggregates these frame features to form global video features. (2) A global-local similarity calculation module calculates similarities at two granularities (video-sentence and frame-word features) to consider both the topic and detail features in the text-video matching process. Our experiments on the four widely used MSR-VTT, LSMDC, MSVD and DiDeMo datasets demonstrate the effectiveness of our model and its superiority over state-of-the-art methods. The results show that the cross-modal attention aggregation approach can effectively capture the primary semantic information of the video. At the same time, the global-local similarity calculation model can accurately match text and video based on topic and detail features.

Keywords:

Citation: Wanru Du, Xiaochuan Jing, Quan Zhu, Xiaoyin Wang, Xuan Liu. A cross-modal conditional mechanism based on attention for text-video retrieval[J]. Mathematical Biosciences and Engineering, 2023, 20(11): 20073-20092. doi: 10.3934/mbe.2023889

Related Papers:

[1]	Shuai Cao, Biao Song . Visual attentional-driven deep learning method for flower recognition. Mathematical Biosciences and Engineering, 2021, 18(3): 1981-1991. doi: 10.3934/mbe.2021103
[2]	Quan Zhu, Xiaoyin Wang, Xuan Liu, Wanru Du, Xingxing Ding . Multi-task learning for aspect level semantic classification combining complex aspect target semantic enhancement and adaptive local focus. Mathematical Biosciences and Engineering, 2023, 20(10): 18566-18591. doi: 10.3934/mbe.2023824
[3]	Ruimin Wang, Ruixiang Li, Weiyu Dong, Zhiyong Zhang, Liehui Jiang . Fine-grained identification of camera devices based on inherent features. Mathematical Biosciences and Engineering, 2022, 19(4): 3767-3786. doi: 10.3934/mbe.2022173
[4]	Yamei Deng, Ting Song, Xu Wang, Yonglu Chen, Jianwei Huang . Region fine-grained attention network for accurate bone age assessment. Mathematical Biosciences and Engineering, 2024, 21(2): 1857-1871. doi: 10.3934/mbe.2024081
[5]	Tong Shan, Jiayong Yan, Xiaoyao Cui, Lijian Xie . DSCA-Net: A depthwise separable convolutional neural network with attention mechanism for medical image segmentation. Mathematical Biosciences and Engineering, 2023, 20(1): 365-382. doi: 10.3934/mbe.2023017
[6]	Xiaolin Gui, Yuanlong Cao, Ilsun You, Lejun Ji, Yong Luo, Zhenzhen Luo . A Survey of techniques for fine-grained web traffic identification and classification. Mathematical Biosciences and Engineering, 2022, 19(3): 2996-3021. doi: 10.3934/mbe.2022138
[7]	Hai Lu, Enbo Luo, Yong Feng, Yifan Wang . Video-based person re-identification with complementary local and global features using a graph transformer. Mathematical Biosciences and Engineering, 2024, 21(7): 6694-6709. doi: 10.3934/mbe.2024293
[8]	Xiaoguang Liu, Yubo Wu, Meng Chen, Tie Liang, Fei Han, Xiuling Liu . A double-channel multiscale depthwise separable convolutional neural network for abnormal gait recognition. Mathematical Biosciences and Engineering, 2023, 20(5): 8049-8067. doi: 10.3934/mbe.2023349
[9]	Zhenwu Xiang, Qi Mao, Jintao Wang, Yi Tian, Yan Zhang, Wenfeng Wang . Dmbg-Net: Dilated multiresidual boundary guidance network for COVID-19 infection segmentation. Mathematical Biosciences and Engineering, 2023, 20(11): 20135-20154. doi: 10.3934/mbe.2023892
[10]	Jiaming Ding, Peigang Jiao, Kangning Li, Weibo Du . Road surface crack detection based on improved YOLOv5s. Mathematical Biosciences and Engineering, 2024, 21(3): 4269-4285. doi: 10.3934/mbe.2024188

Abstract

1. Introduction

High-accuracy image recognition tasks in specific scenarios have always been a prominent research topic in computer vision ^[1,2,3,4]. Railway stations, banks, airports and border controls are critical systems where deep learning technology has been widely applied ^[5,6]. On the one hand, the extensive use of deep learning in these scenarios has significantly reduced human and material resources. On the other hand, the uniqueness of these scenarios imposes a high demand for accuracy in deep learning techniques. In particular, security systems necessitate real-time and efficient tracking, management and protection of valuable materials. In this context, cargo recognition plays a critical role. It enables automated and efficient management, enhancing both work efficiency and safety. Therefore, achieving high accuracy and reliability in cargo recognition within high-security systems has emerged as a pressing concern in the field of security technology.

Traditional methods for cargo identification require manual intervention to obtain additional information about the cargo ^[7,8,9]. Furthermore, the recognition results of traditional methods tend to be suboptimal in complex cargo environments. In contrast, deep learning-based cargo recognition methods exhibit stable performance in intricate environments and can be readily deployed in practical production settings. By employing neural network models and corresponding training techniques, cargo recognition can be automatically achieved in diverse environments, regardless of artificial or other external factors.

Zhu et al. ^[10] proposed an attribute-guided two-layer learning framework that can identify unknown image categories, thus improving the robustness and performance of few-shot image recognition. Zeng et al. ^[11] proposed a convolutional neural network model for classifying house styles and achieved reasonable classification results on a small sample dataset, which confirmed the possibility of house style recognition. Yi et al. ^[12] proposed an end-to-end trained superpixel convolutional neural network by treating irregular superpixel crystals as 2D point clouds and using PointConv layers instead of standard convolutional layers to process these point clouds, thereby learning advanced representations of image superpixel elements and improving superpixel efficiency while obtaining considerable image recognition effects. The aforementioned methods have achieved satisfactory recognition results in their respective scenarios, but they have also ignored the fine-grained features of the recognition objects while only focusing on global and coarse-grained features.

Koyun et al. ^[13] proposed a two-stage object detection framework named "Focus-and-Detect" for detecting small objects in aerial images and introduced the Incomplete Box Suppression (IBS) method to address the truncation effect of region search methods. This framework demonstrated the best performance in small object detection on the VisDrone validation dataset. Wang et al. ^[14] presented a small object detection method based on an enhanced Single Shot MultiBox Detector (SSD) algorithm. The method replaced the original VGG-16 with an improved dense convolutional network (C-DenseNet) and incorporated residual prediction layers and DIoU-NMS. This approach effectively resolves the issues of false detection and missed detection in small object detection for object detection algorithms. Dong et al. ^[15] proposed a new object detection method based on a feature pyramid network (FPN), which introduces a multi-scale deformable attention module (MSDAM) and a multi-level feature aggregation module (MLFAM) to enhance the performance of remote sensing object detection (RSOD), achieving accurate detection on optical remote sensing images (DIOR) and RSOD datasets. These object detection methods have achieved effective object localization and recognition, but at the cost of a large amount of manual annotation and more focused attention on the local and fine-grained features of the objects.

Addressing the aforementioned issues, we present an AGMG-Net to enhance the accuracy of cargo recognition in security scenarios. The proposed network can effectively capture the distribution of focused regions, accurately locate the target position without manual annotation, separate the target from the background and fuse multi-granularity features to achieve precise identification of cargo. The major contributions of this study are outlined as follows:

● We propose an AMAA method to solve the problem of difficulty in locating targets in complex security system environments.

● We also propose a multi-region confidence-dependent optimal selection method to reduce the dependency on the threshold of foreground-background segmentation.

● Building on these two methods, we present an attention-guided multi-granularity feature fusion network that effectively enhances the accuracy of cargo recognition in security systems.

2. Related works

This section provides a brief review of the most relevant work, encompassing multi-branch models, weakly supervised object localization (WSOL) and fine-grained visual classification.

2.1. Multi-branch model

Multi-branch networks as a fundamental structure in deep learning have found wide applications in various task domains including semantic segmentation ^[16,17] and object detection ^[18,19,20], enabling the capture and learning of richer and more diverse features. For instance, Xie et al. ^[21] proposed a multi-branch network for disease detection in retinal images, enhancing the representation of disease-specific features through the fusion of multi-scale and spatial features. To overcome the limited capacity for extracting global spatial information, Xu et al. ^[22] introduced a dual-branch network composed of a grouped bidirectional LSTM (GBiLSTM) network and a multi-level fusion convolutional transformer (MFCT), generating distinct and robust spectral-spatial features for hyperspectral image classification with limited labeled samples. Addressing the challenge of small object detection in aerial images with limited samples, Zhang et al. ^[23] proposed a multi-branch network incorporating a transformer branch, leveraging the strengths of generative models and transformer networks to improve the robustness of small object detection in complex environments. To address the problem of significant non-linear differences between image blocks in image matching, in the context of image matching, where significant non-linear differences exist between image blocks, Yu et al. ^[24] presented a composite metric network comprising a main metric network module and multiple branch metric network modules to capture richer and more distinctive feature differences. Overall, the concept of multi-branch networks has emerged as a crucial research direction in deep learning, offering practical solutions for diverse task domains.

2.2. WSOL

Since Zhou et al. ^[25] introduced the use of Class Activation Maps (CAM) to characterize object locations, an increasing number of works have applied them to the field of WSOL ^{[26,27,28,29,30,31]}. CAM represents a feature map obtained from the global average pooling layer of a classification network and it is weighted before applying softmax to emphasize the position of the target object. Hwang et al. ^[32] introduced a target localization strategy using entropy regularization, which considers the one-hot labels and the entropy of predicted probabilities, thereby striking a balance between WSOL scores and classification performance. Zhang and Yang ^[33] proposed an adaptive attention enhancer to address the limitation of existing WSOL methods that lack modeling of the correlation between different regions of the target object. This enhancer supplements object attention by discovering the semantic correspondence between different regions. Gao et al. ^[34] presented a token semantics coupled attention map (TS-CAM) to tackle the challenge of learning object localization models given image category labels. The self-attention mechanism in vision transformers is utilized to extract long-term dependencies and compensate for the limitations of Convolutional Neural Networks (CNNs) in partial activations. These works demonstrate that WSOL can accomplish object localization tasks with only image annotations, eliminating the need for manual annotation in object detection tasks. Moreover, they hold significant value in image recognition by enabling networks to quickly and accurately identify recognized subjects through WSOL methods, while extracting fine-grained features of the image through the CG-Net branch.

2.3. Fine-grained visual classification

The closest approach to this paper is ^[35,36], where Wang et al. ^[35] proposed an accurate semantic-guided discriminative region localization method for fine-grained image recognition methods that ignore the spatial correspondence between low-level details and high-level semantics. Du et al. ^[36] introduced a novel framework for fine-grained visual classification, addressing challenges in identifying discriminative granularities and fusing information. The framework includes a progressive training strategy and a jigsaw puzzle generator, achieving state-of-the-art performance on benchmark datasets. The difference between this paper and ^[35,36] is that this paper introduces CAM, a classic practice in WSOL, to initially localize the target, and accomplishes richer feature extraction through operations such as accumulation and coarse- and fine-grained feature fusion, to adequately learn the multi-granularity feature information in the image. Wang et al. ^[37] proposed Prompting vision-Language Evaluator (PLEor), a novel framework for open set fine-grained retrieval, based on the Contrastive Language-Image Pretraining (CLIP) model. PLEor leverages the pre-trained CLIP model to infer category-specific discrepancies and transfer them to the backbone network trained in close set scenarios. Wang et al. ^[38] introduced a Fine-grained Retrieval Prompt Tuning (FRPT), and by utilizing sample prompting and feature adaptation, FRPT achieves state-of-the-art performance on fine-grained datasets with fewer parameters.

3. Methods

This paper introduces the AGMG-Net, which consists of the Fine-Grained Net (FG-Net), the Coarse-Grained Net (CG-Net) and the Multi-Granularity Fusion Net (MGF-Net). Cargo image information is complex and exhibits both coarse-grained and fine-grained features. Conventional deep learning approaches primarily focus on learning and extracting coarse-grained features, which limits their ability to capture fine-grained features and leads to inaccuracies and omissions in cargo recognition. To address these challenges, we propose the AGMG-Net. The FG-Net uses global attention to extract features and learn coarse-grained characteristics, such as color, shape and position. The CG-Net employs deep convolution to extract fine-grained features, including texture and structure. The MGF-Net conducts feature fusion and learning on multi-granularity features. The final cargo recognition result is obtained by applying a majority rule after classification using fully connected layers. Figure 1 illustrates the network structure.

Figure 1. The structure of AGMG-Net, the dotted frame above contains the structure of FG-Net, CG-Net and MGF-Net, and below is the AMAA module and MOSBC module.

DownLoad: Full-Size Img PowerPoint

3.1. FG-Net

The input image undergoes initial processing by the FG-Net module to extract coarse-grained features. The FG-Net module comprises three components: a feature extractor, a classifier and an AMAA module. The feature extractor consists of convolutional and global self-attention blocks, which enable the learning of global features and the extraction of coarse-grained features. The classifier generates the cargo classification result using a global average pooling layer and a fully connected layer. During the early-to-mid training phase, the AMAA module accumulates the multi-stage attention map, directing the CG-Net to focus on the target object for extracting fine-grained features. As training progresses, the AMAA module guides the CG-Net to pay more attention to the overall image, to a certain extent, facilitating effective fusion of the cargo's coarse-grained features.

Suppose the original image is shown in , the AMAA module generates an attention map using the Class Activation Mapping (CAM) method, as illustrated in . Let $I \in \mathbb{R}^{C \times W \times H}$ denote the input. The attention map acquired from each iteration is denoted as $A_{t} \in \mathbb{R}^{w \times h}$ . By setting the initial condition as $M_{0} = A_{0}$ , the cumulative attention map $M^{c}$ can be calculated using Eq (3.1).

$\begin{equation} M_{t}^{c} = max(M_{t-1}^{c}, A_{t}^{c}), \ \ t = 1, 2, \dots \end{equation}$

(3.1)

Figure 2. Attention Map in the training stage of Cargo dataset.

DownLoad: Full-Size Img PowerPoint

Here $C$ represents the number of channels, $H$ and $W$ represent the width and height of the image, $h$ and $w$ represent the width and height of the attention map, and $c$ stands for the category.

Different positions are emphasized at each stage of the network during training, and the resulting cumulative attention map $M_{t}^{c}$ reflects the attention region distribution for a given category. In the early and middle stages of training, $M_{t}^{c}$ exhibits more accurate localization ability than the attention map $A_{t}^{c}$ . However, due to the gradual accumulation of maximum values during the calculation of the attention map $M$ , regions with excessively high attention values can lead to inaccurate target localization. To address this issue, we propose the AMAA module in this paper. The AMAA module retains the attention maps from the previous $k-1$ stages along with the cumulative attention map $M_{t}$ , forming an attention sequence denoted as $M\!L$ , which has a length of $k$ . The specific formulation of $M\!L$ is presented in Eq (3.2). By utilizing the attention sequence, which comprehensively considers the cumulative attention map $M$ and the recent $k-1$ attention maps, and performing a weighted summation of $M\!L$ using Eq (3.3), we can prevent misleading final localization caused by abnormal attention maps.

$\begin{equation} M\!L = \left [M_{t}, A_{t}, A_{t-1}, \dots, A_{t-k+1} \right ] \end{equation}$

(3.2)

$\begin{equation} A = \sum\limits_{i = 0}^{k} \alpha_{i}M\!L_{i} \end{equation}$

(3.3)

where $\alpha \in \mathbb{R}^{k}$ . Since taking the simple average of the attention sequence $M\!L$ would diminish the impact of the most recent attention map on the overall attention map $A$ , this paper adopts the approximate forgetting function ^[39] $L(x, k)$ to compute the initial value of $\alpha$ . Subsequently, $\alpha$ undergoes normalization using Eq (3.5).

$\begin{equation} L\left ( x, k \right ) = \frac{184k}{125x+1.84k} \end{equation}$

(3.4)

$\begin{equation} \alpha_{i} = softmax\left ( L\left ( i, k \right ) \right ) \end{equation}$

(3.5)

The AMAA module not only considers the cumulative attention map but also incorporates the attention maps computed in the previous $k-2$ iterations to generate the comprehensive attention map $A$ for the current iteration. As the model undergoes continuous training, AMAA gradually converges towards the actual distribution of the target position, i.e., the true location of the target object. This enables AMAA to efficiently accomplish target localization, consequently guiding CG-Net in performing localization and recognition.

3.2. CG-Net

The comprehensive attention map generated by FG-Net and the original input are both fed into CG-Net for fine-grained feature extraction. CG-Net consists of three components: the MOSBC module, the feature extractor and the classifier. As shown in Figure 3, the MOSBC module first performs image fusion on the input and then selects the target region. The feature extractor has 11 convolutional layers, which enable the capture of detailed information and local features in the image, facilitating the extraction of fine-grained image features. The classifier consists of one global average pooling layer and two fully connected layers, enabling precise image classification.

Figure 3. Detailed structure of CG-Net.

DownLoad: Full-Size Img PowerPoint

The comprehensive attention map $A$ , as illustrated in , still includes non-target areas and exhibits blurred edges within the target region, despite attention accumulation. To address this issue, this paper proposes a region localization method called MOSBC, which aims to identify the most probable area where the target is located. Initially, a two-dimensional bilinear interpolation is employed to transform the comprehensive attention map $A$ into an output of the same width and height, denoted as $A' \in \mathbb{R}^{H \times W}$ . Subsequently, a threshold $\gamma \in (0, 1)$ is utilized to delineate the foreground and background in the attention map $A$ , as depicted in Eq (3.6).

$\begin{equation} A^{*} = \begin{cases} 0 & A'\le\gamma \\ 1 & otherwise \\ \end{cases} \end{equation}$

(3.6)

After dividing the foreground and background, the Euclidean distance from each foreground position to the nearest background is first calculated. Then, the peak distance values are found from the image, and the connected components are analyzed using eight connectivity for local peaks. The watershed algorithm ^[40] is then used for image segmentation to obtain the target region set, generating N candidate regions ${R_n} = \left({{x_n}, {y_n}, {h_n}, {w_n}} \right), n = 1, \dots, N$ , where ${x_n}, {y_n}$ is the center coordinate of the candidate region $R_n$ and ${h_n}, {w_n}$ is the height and width. Then, Eq (3.7) is used to calculate the confidence score ${\rho _n}$ for each region and used Eq (3.8) to select the region with the highest confidence score as the finally target $Region$ .

$\begin{equation} \rho_{n} = \frac{1}{h_{n} \times w_{n}} \sum\limits_{i = 0}^{h_{n}} \sum\limits_{j = 0}^{w_{n}} R_{n} \end{equation}$

(3.7)

$\begin{equation} Region = argmax\left \{ R_{i} | \rho_{i} = max\left \{ \rho_{j} \right \} , 1\le i, j \le N \right \} \end{equation}$

(3.8)

It is important to note that the $Region$ obtained at this stage represents a localization box, which requires overlaying it onto the input $I$ to extract the fine-grained image of the target through cropping. Subsequently, the fine-grained image is resized using two-dimensional bilinear interpolation to match the size of $I$ and then fed into the feature extractor of CG-Net for fine-grained feature extraction.

3.3. MGF-Net

MGF-Net is responsible for multi-scale feature fusion and classification. It consists of a feature fusion layer and a classifier. The feature fusion layer uses the concatenation operation to achieve multi-scale feature fusion at the channel level. This is in contrast to other feature fusion methods (such as addition or multiplication), which do not ensure the diversity and completeness of features. Concatenation can handle feature maps of multiple scales simultaneously, which makes it more flexible. The MGF-Net classifier consists of one global average pooling layer and three fully connected layers. This enables multi-scale image classification. The structure of MGF-Net is depicted in Figure 4 and the network's layer parameters are detailed in Table 1.

Figure 4. Detailed structure of CG-Net MGF-Net.

DownLoad: Full-Size Img PowerPoint

Table 1. MGF-Net parameters.

Layers	Resolution	Description
Input	$(7, 7, 1026), (7, 7,512)$	-
Concat	$(7, 7, 1026+512)$	Concat operation
Global Pool	$(7, 7, 1538)$	Global average pooling
FC	-	4096-dimensional FC layer
FC	-	4096-dimensional FC layer
FC	-	k-dimensional FC layer

| Show Table

DownLoad: CSV

The feature map with 1026 channels is derived from FG-Net, while the feature map with 512 channels originates from CG-Net. The number of channels obtained is 1026 after the input image is convolved by 4 layers of FG-Net and 1 layer of transformer, and the number of channels obtained is 512 after the image processed by the MOSBC module is input to CG-Net and convolved by 11 layers as shown in Figure 3.

4. Experiments and datasets

To evaluate the effectiveness of our proposed method, AGMG-Net was compared against state-of-the-art methods, and ablation experiments were conducted on three publicly available and self-built image recognition datasets.

4.1. Datasets

To simulate real-world scenarios, the experiments were performed on the publicly available Flower and Butterfly datasets. The Flower dataset, obtained from http://download.tensorflow.org/example images/flower phones.tgz, comprises 4323 images of flowers from 5 categories, each with random resolutions. The original butterfly dataset ^[41] consists of 200 categories, from which 20 categories were selected to create a smaller dataset named butterfly20. This subset contains 2066 images with random resolutions. Sample images from both datasets are depicted in Figure 5, exhibiting diverse lighting angles, shooting angles, distances and backgrounds, resembling the conditions and environments encountered in security systems for cargoes. Utilizing these datasets in the experiments enables a more accurate reflection of practical scenarios, enhancing the experiment's reliability and generalization ability. Evaluating the performance of AGMG-Net under different conditions using these datasets allows for a more comprehensive assessment, thereby improving the algorithm's robustness.

Figure 5. Sample sample of each dataset.

DownLoad: Full-Size Img PowerPoint

In addition, we created a self-built cargo recognition dataset named "Cargo" specifically for recognizing cargoes in security systems. The dataset comprises 3 categories and 4715 images with random resolutions, organized in a structure identical to the aforementioned publicly available datasets. An overview of the Cargo dataset is provided in Table 2.

Table 2. Data specification.

Item	Values
Modalitites	RGB
Total number of images	$4715$
Number of classes	$3$
Number of angle classes	$6$
Number of distance classes	$5$
Number of background classes	$7$

| Show Table

DownLoad: CSV

4.2. Experimental environment and parameter settings

The hardware environment used for this experiment is Intel $^\circledR$ Xeon $^\circledR$ Platinum 8255C CPU @ 2.50GHz with 12 CPU cores, 32GB DDR4 memory, and NVIDIA GeForce RTX 3090 graphics card. The software platform is Ubuntu 20.02-LTS operating system, Python version 3.7, Pytorch version 1.12.1, CUDA version 11.3 and cuDNN version 8.2.1.

The hyperparameters were set as follows: 100 training iterations were performed for the Flower dataset, 200 for the Butterfly20 dataset and 120 for the Cargo dataset. The Adam optimizer was utilized with a batch size of 16 and a learning rate of 0.001. The training set and testing set were randomly split in a ratio of 8:2, ensuring an equal number of samples for each class.

4.3. Evaluation metrics

To objectively evaluate the classification performance of different models, two commonly used metrics in classification tasks, namely accuracy and F1-score, were adopted. The F1-score is a statistical measure of classification model precision, as expressed mathematically in Eq (4.1).

$\begin{equation} F1\!-\!score = \frac{{2 \times Precision \times Recall}}{{Precision + Recall}} \end{equation}$

(4.1)

where $Precision = T\!P/(T\!P + F\!P)$ , $Recall = T\!P/(T\!P + F\!N)$ , $T\!P$ represents the number of correctly classified positive samples, $T\!N$ represents the number of correctly classified negative samples, $F\!P$ represents the number of incorrectly classified positive samples and $F\!N$ represents the number of incorrectly classified negative samples.

4.4. Robustness of $\gamma$ and $M\!L$

In this section, we first conducted test experiments on CG-Net with the pruning threshold $\gamma$ , while setting the attention sequence length $M\!L$ to 4. The performance of AGMG-Net on the Cargo dataset is presented in Table 3.

Table 3. Accuracy of different clipping thresholds in Cargo dataset.

Threshold $\gamma$	Accuracy $(\%)$	F1-score $(\%)$
$0.1$	$98.81$	$94.26$
$0.2$	$99.12$	$95.33$
$0.4$	$99.22$	$96.14$
$0.6$	$99.22$	$96.09$
$0.8$	$99.01$	$95.51$

| Show Table

DownLoad: CSV

As shown in , the accuracy of AGMG-Net is the best when the cropping threshold $\gamma$ is 0.2 to 0.6. The accuracy is the worst when it is 0.1, and there is a significant decrease in accuracy when it is 0.8. This shows that $\gamma$ in the range of 0.2 to 0.6 can help the model to complete the target cropping and effectively improve the recognition performance of the model. When $\gamma$ is set to 0.4, the F1-score of AGMG-Net reaches the maximum value of 96.14%. Therefore, all subsequent experiments set $\gamma$ to 0.4.

Furthermore, we conducted additional experiments on the attention sequence length $M\!L$ with $\gamma$ set to 0.4. The performance of AGMG-Net on the Cargo dataset is presented in Table 4.

Table 4. Accuracy of different attention sequence lengths in Cargo dataset.

$M\!L$	Accuracy $(\%)$	F1-score $(\%)$
$2$	$98.28$	$96.02$
$4$	$99.22$	$96.14$
$6$	$99.43$	$97.21$
$8$	$99.58$	$97.35$
$10$	$99.36$	$97.13$

| Show Table

DownLoad: CSV

shows that AGMG-Net achieves the highest accuracy when $M\!L$ is between 6 and 8. The accuracy is significantly lower when $M\!L$ is between 2 and 4, and it decreases significantly when $M\!L$ is set to 10. These findings suggest that maintaining multiple adjacent attention maps between 6 and 8 helps mitigate the influence of outliers during training and improves recognition performance. The model achieves the highest recognition performance when $M\!L$ is set to 8. Therefore, we set $M\!L$ to 8 in subsequent experiments.

4.5. Ablation

The ablation experiment was conducted to investigate the impact of the AMAA and MOSBC modules in AGMG-Net on network performance. We removed the AMAA and MOSBC modules from FG-Net and CG-Net and conducted ablation experiments on the three datasets mentioned in Section 4.1. The experimental results, depicted in Figure 6, show that the AMAA and MOSBC modules have a significant impact on the recognition accuracy of the AGMG-Net model.

Figure 6. The comparison effect of adding and removing AMAA and MOSBC modules in the three datasets. "Primitive" stands for after removing the modules, "Primitive ++" stands for both AMAA and MOSBC in effect.

DownLoad: Full-Size Img PowerPoint

It is evident from Figure 6(c) that the addition of the AMAA and MOSBC modules increases the number of parameters that AGMG-Net needs to learn. In the early training process, only FG-Net is trained, resulting in relatively low accuracy. However, as the test accuracy reaches a threshold of 0.95, CG-Net starts participating in the training process, leading to rapid object localization and multi-scale feature fusion in 63 rounds. This can be observed in Figure 6(a) and (b), where AGMG-Net surpasses the performance of other models on the Flower (threshold: 0.8), Butterfly20 (threshold: 0.8) and Cargo dataset, with increasing margins at rounds 43, 81 and 63. The optimal values are achieved at rounds 97,105 and 143, with lead values of 3.7, 5.48 and 0.29%. These experiments demonstrate that the AMAA and MOSBC modules positively impact the recognition accuracy of AGMG-Net in image recognition.

Table 5 shows the results of the quantization of the ablation experiment in Figure 6. The results in Table 5 using only the AMAA module are higher than the accuracy of Primitive. This demonstrates that using only the CAM as a basis for localization does not learn more features, and can even be misled by the anomalous CAM. After AMAA and MOSBC work together, the model shows the highest accuracy and the lowest loss. This indicates that the MOSBC module can perform proper target acquisition after effectively localizing the target, providing more complete target information for AGMG-Net.

Table 5. Impact of AMAA and MOSBC modules on three datasets.

Methods	Flower		Butterfly20		Cargo
Methods	Accuracy(%)	Loss	Accuracy(%)	Loss	Accuracy(%)	Loss
Primitive	89.03	0.7017	83.10	0.8899	99.29	0.0698
Primitive + AMAA	89.27	0.4241	85.95	0.7994	99.47	0.0762
Primitive + MOSBC	86.72	0.6617	81.01	1.4036	98.79	0.1127
Primitive ++	${\bf 92.73}$	${\bf 0.3697}$	${\bf 88.57}$	${\bf 0.3751}$	${\bf 99.58}$	${\bf 0.0211}$

| Show Table

DownLoad: CSV

4.6. Comparative analysis

To evaluate the effectiveness of the proposed model, this section conducts comparative experiments on the Flower, Butterfly20 and Cargo datasets. Comparing AGMG-Net with representative traditional convolutional networks (VGG ^[42]), residual networks (ResNeSt ^[43]), visual transformers (ViT ^[44]) and CoAtNet ^[45] models that integrate the advantages of both convolution and transformer. The experiment maintains consistent training parameters, including batch size, learning rate, number of iterations and weight decay. The results are presented in Tables 6–8.

Table 6. Comparison of Classification models in Flower dataset. "Train" stands for training speed and "Test" stands for testing speed in Seconds Per Image (SPI).

Methods	Accuracy $(\%)\uparrow$	F1-score $(\%)\uparrow$	Loss $\downarrow$	Train (SPI) $\downarrow$	Test (SPI) $\downarrow$
VGG	74.13	37.84	0.8189	${\bf 0.0066}$	${\bf 0.0020}$
ResNeSt	86.49	44.01	0.6651	0.0075	${\bf 0.0020}$
ViT	69.86	34.08	0.8850	${\bf 0.0066}$	0.0025
CoAtNet	88.50	54.11	0.3550	0.0103	0.0031
DeepMAD (SOTA)	90.57	61.33	${\bf 0.3368}$	0.0226	0.0068
AGMG-Net	${\bf 92.73}$	${\bf 63.71}$	0.3697	0.0212	0.0127

| Show Table

DownLoad: CSV

Table 7. Comparison of Classification models in Butterfly20 dataset.

Methods	Accuracy $(\%)\uparrow$	F1-score $(\%)\uparrow$	Loss $\downarrow$	Train (SPI) $\downarrow$	Test (SPI) $\downarrow$
VGG	62.62	34.11	1.1135	${\bf 0.0066}$	${\bf 0.0020}$
ResNeSt	77.62	41.45	0.8012	0.0076	0.0021
ViT	50.00	15.52	1.7806	${\bf 0.0066}$	0.0026
CoAtNet	85.00	52.60	0.6337	0.0101	0.0031
DeepMAD (SOTA)	88.41	71.62	0.4246	0.0232	0.0081
AGMG-Net	${\bf 88.57}$	${\bf 73.45}$	${\bf 0.3751}$	0.0217	0.0131

| Show Table

DownLoad: CSV

Table 8. Comparison of Classification models in Cargo dataset.

Methods	Accuracy $(\%)\uparrow$	F1-score $(\%)\uparrow$	Loss $\downarrow$	Train (SPI) $\downarrow$	Test (SPI) $\downarrow$
VGG	99.15	94.69	0.0421	0.0067	${\bf 0.0019}$
ResNeSt	99.21	95.97	0.0332	0.0075	0.0020
ViT	92.48	69.16	0.2025	${\bf 0.0066}$	0.0025
CoAtNet	99.23	96.02	0.0244	0.0103	0.0032
DeepMAD (SOTA)	99.53	97.04	0.0256	0.0225	0.0065
AGMG-Net	${\bf 99.58}$	${\bf 97.35}$	${\bf 0.0211}$	0.0194	0.0111

| Show Table

DownLoad: CSV

VGG represents traditional convolutional networks and performs excellently in image classification tasks due to its regular network structure, which comprises convolutional and pooling layers. This structure enables VGG to quickly capture image features within a limited number of training iterations. ResNeSt represents residual networks, which incorporate residual blocks that facilitate easier model training and enable capturing deep features of cargoes in complex environments. ViT is a transformer-based visual model that partitions the image into small blocks and processes them through multi-head self-attention mechanisms. This allows ViT to capture global information and the relationships between image blocks, resulting in high recognition accuracy on the ImageNet dataset. CoAtNet combines deep convolutional networks with self-attention mechanisms, enabling it to efficiently extract target features and perform well on both small-scale and large-scale datasets.

The results of ViT in Table 7 show that visual transformers excel at learning global and coarse-grained features, but they struggle to extract effective features from a limited number of samples. On the other hand, convolutional models like VGG and ResNeSt perform relatively poorly at learning fine-grained features of the target, but they can still achieve decent results on small-scale datasets. CoAtNet combines the strengths of both approaches, exhibits powerful feature extraction capabilities and achieves an impressive accuracy of 85%. AGMG-Net, with its CG-Net subnetwork that extracts finer-grained features and combines multiscale features, surpasses CoAtNet in accuracy, demonstrating the effectiveness of combining coarse and fine-grained features.

The information presented in Tables 6 and 8 shows that the distinctive characteristics of the ViT model are effectively utilized on the medium-scale datasets Flower and Cargo, facilitating the extraction of features at a global scale. Furthermore, the VGG and ResNeSt models demonstrate improved learning capabilities in capturing intricate details within the datasets. However, it is noteworthy that the CoAtNet model surpasses both VGG and ResNeSt in terms of its remarkable feature extraction ability, achieving notably high accuracy rates of 88.5 and 99.23%, as well as F1-scores of 52.6 and 96.02%, respectively. Nevertheless, the proposed AGMG-Net model in this study, which incorporates more comprehensive coarse-grained features and richer fine-grained features, exhibits a slight performance improvement over the CoAtNet model, yielding gains of 3.57 and 0.35% in accuracy. Comparing the running speeds of the models in Tables 6–8, it can be seen that AGMG-Net is more time-consuming in the inference process, but faster than the state-of-the-art (SOTA) model in the training process. The multi-branch model AGMG-Net achieves 99.58% accuracy on the Cargo dataset using only a slightly longer time, which shows that the model proposed in this paper is more suitable for security systems that do not require high recognition speed but have higher recognition accuracy.

The above experiments demonstrate that AGMG-Net can effectively leverage the multi-granularity features of the data, resulting in higher classification accuracy and precision. Additionally, AGMG-Net has the potential to accurately classify cargo in complex environmental conditions, demonstrating its practicality and usability for cargo recognition tasks.

5. Conclusions

The AGMG-Net proposed in this paper has significant advantages in cargo recognition. Unlike existing methods, AGMG-Net specifically considers the fine-grained features of cargo and leverages multiscale features to enhance recognition accuracy, even when data is limited and there are minimal differences between cargo classes. AGMG-Net incorporates the coarse-grained branch's AMAA module for target localization and the fine-grained branch's MOSBC module for target cropping. It then combines the feature maps of both branches through a multiscale fusion branch in Concat mode. Furthermore, it improves prediction accuracy by employing the majority voting method in the prediction layer.

The average recognition rates on the self-built Cargo dataset, as well as the public datasets Flower and Butterfly20, are 99.58, 92.73 and 88.57%. Experimental results demonstrate that AGMG-Net outperforms VGG, ResNeSt, ViT and CoAtNet in terms of classification effectiveness.

In conclusion, AGMG-Net is an effective cargo recognition model that enhances recognition accuracy and classification effectiveness through the integration of attention-guided multiscale feature fusion. It can be successfully applied to cargo recognition tasks within complex security system environments, thereby significantly contributing to cargo safety.

Use of AI tools declaration

The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.

Conflict of interest

The authors declare there is no conflict of interest.

References

[1]	A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, et al., An image is worth 16x16 words: Transformers for image recognition at scale, preprint, arXiv: 2010.11929.
[2]	Y. Liu, S. Albanie, A. Nagrani, A. Zisserman, Use what you have: Video retrieval using representations from collaborative experts, preprint, arXiv: 1907.13487.
[3]	X. Wang, J. Wu, J. Chen, L. Li, Y. F. Wang, W. Y. Wang, Vatex: A large-scale, high-quality multilingual dataset for video-and-language research, in Proceedings of the IEEE/CVF International Conference on Computer Vision, (2019), 4581–4591. https://doi.org/10.1109/ICCV.2019.00468
[4]	L. Zhu, Y. Yang, Actbert: Learning global-local video-text representations, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2020), 8746–8755. https://doi.org/10.1109/CVPR42600.2020.00877
[5]	F. C. Heilbron, V. Escorcia, B. Ghanem, J. C. Niebles, Activitynet: A large-scale video benchmark for human activity understanding, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2015), 961–970. https://doi.org/10.1109/CVPR.2015.7298698
[6]	H. Luo, L. Ji, B. Shi, H. Huang, N. Duan, T. Li, et al., Univl: A unified video and language pre-training model for multimodal understanding and generation, preprint, arXiv: 2002.06353.
[7]	N. Shvetsova, B. Chen, A. Rouditchenko, S. Thomas, B. Kingsbury, R. S. Feris, et al., Everything at once-multi-modal fusion transformer for video retrieval, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2022), 20020–20029. https://doi.org/10.1109/CVPR52688.2022.01939
[8]	N. C. Mithun, J. Li, F. Metze, A. K. Roy-Chowdhury, Learning joint embedding with multimodal cues for cross-modal video-text retrieval, in Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval, (2018), 19–27. https://doi.org/10.1145/3206025.3206064
[9]	J. Xu, T. Mei, T. Yao, Y. Rui, Msr-vtt: A large video description dataset for bridging video and language, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2016), 5288–5296. https://doi.org/10.1109/CVPR.2016.571
[10]	Y. Yuan, T. Mei, W. Zhu, To find where you talk: Temporal sentence localization in video with attention based location regression, in Proceedings of the AAAI Conference on Artificial Intelligence, 33 (2019), 9159–9166. https://doi.org/10.1609/aaai.v33i01.33019159
[11]	J. Dong, X. Li, C. Xu, S. Ji, Y. He, G. Yang, et al., Dual encoding for zero-example video retrieval, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2019), 9346–9355. https://doi.org/10.1109/CVPR.2019.00957
[12]	M. Bain, A. Nagrani, G. Varol, A. Zisserman, Frozen in time: A joint video and image encoder for end-to-end retrieval, in Proceedings of the IEEE/CVF International Conference on Computer Vision, (2021), 1728–1738. https://doi.org/10.1109/ICCV48922.2021.00175
[13]	A. Miech, D. Zhukov, J. B. Alayrac, M. Tapaswi, I. Laptev, J. Sivic, Howto100m: Learning a text-video embedding by watching hundred million narrated video clips, in Proceedings of the IEEE/CVF International Conference on Computer Vision, (2019), 2630–2640. https://doi.org/10.1109/ICCV.2019.00272
[14]	S. Wang, R. Wang, Z. Yao, S. Shan, X. Chen, Cross-modal scene graph matching for relationship-aware image-text retrieval, in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, (2020), 1508–1517. https://doi.org/10.1109/WACV45572.2020.9093614
[15]	F. Shang, C. Ran, An entity recognition model based on deep learning fusion of text feature, Inf. Process. Manage., 59 (2022), 102841. https://doi.org/10.1016/j.ipm.2021.102841 doi: 10.1016/j.ipm.2021.102841
[16]	Y. Zhou, R. Zhang, C. Chen, C. Li, C. Tensmeyer, T. Yu, et al., Towards language-free training for text-to-image generation, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2022), 17907–17917. https://doi.org/10.1109/CVPR52688.2022.01738
[17]	W. Li, S. Wen, K. Shi, Y. Yang, T. Huang, Neural architecture search with a lightweight transformer for text-to-image synthesis, IEEE Trans. Network Sci. Eng., 9 (2022), 1567–1576. https://doi.org/10.1109/TNSE.2022.3147787 doi: 10.1109/TNSE.2022.3147787
[18]	T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in vector space, preprint, arXiv: 1301.3781.
[19]	F. Perronnin, C. Dance, Fisher kernels on visual vocabularies for image categorization, in 2007 IEEE Conference on Computer Vision and Pattern Recognition, (2007), 1–8. https://doi.org/10.1109/CVPR.2007.383266
[20]	B. Klein, G. Lev, G. Sadeh, L. Wolf, Associating neural word embeddings with deep image representations using Fisher vectors, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2015), 4437–4446. https://doi.org/10.1109/CVPR.2015.7299073
[21]	R. Kiros, R. Salakhutdinov, R. S. Zemel, Unifying visual-semantic embeddings with multimodal neural language models, preprint, arXiv: 1411.2539.
[22]	A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, et al., Learning transferable visual models from natural language supervision, preprint, arXiv: 2103.00020.
[23]	I. Croitoru, S. V. Bogolin, M. Leordeanu, H. Jin, A. Zisserman, S. Albanie, et al., Teachtext: Crossmodal generalized distillation for text-video retrieval, in Proceedings of the IEEE/CVF International Conference on Computer Vision, (2021), 11583–11593. https://doi.org/10.1109/ICCV48922.2021.01138
[24]	A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lučić, C. Schmid, Vivit: A video vision transformer, in Proceedings of the IEEE/CVF International Conference on Computer Vision, (2021), 6836–6846. https://doi.org/10.1109/ICCV48922.2021.00676
[25]	G. Bertasius, H. Wang, L. Torresani, Is space-time attention all you need for video understanding?, ICML, 2 (2021), 4. https://doi.org/10.48550/arXiv.2102.05095 doi: 10.48550/arXiv.2102.05095
[26]	A. Miech, J. B. Alayrac, L. Smaira, I. Laptev, J. Sivic, A. Zisserman, End-to-end learning of visual representations from uncurated instructional videos, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2020), 9879–9889. https://doi.org/10.1109/CVPR42600.2020.00990.
[27]	X. Duan, W. Huang, C. Gan, J. Wang, W. Zhu, J. Huang, Weakly supervised dense event captioning in videos, Adv. Neural Inf. Process. Syst., 31 (2018). https://doi.org/10.48550/arXiv.1812.03849
[28]	R. Tan, H. Xu, K. Saenko, B. A. Plummer, Logan: Latent graph co-attention network for weakly-supervised video moment retrieval, in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, (2021), 2083–2092. https://doi.org/10.1109/WACV48630.2021.00213
[29]	V. Gabeur, C. Sun, K. Alahari, C. Schmid, Multi-modal transformer for video retrieval, in Computer Vision–ECCV 2020: 16th European Conference, (2020), 214–229. https://doi.org/10.48550/arXiv.2007.10639
[30]	M. Dzabraev, M. Kalashnikov, S. Komkov, A. Petiushko, Mdmmt: Multidomain multimodal transformer for video retrieval, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2021), 3354–3363. https://doi.org/10.1109/CVPRW53098.2021.00374
[31]	L. Li, Z. Shu, Z. Yu, X. J. Wu, Robust online hashing with label semantic enhancement for cross-modal retrieval, Pattern Recognit., 145 (2023), 109972. https://doi.org/10.1109/ICME55011.2023.00173 doi: 10.1109/ICME55011.2023.00173
[32]	Z. Shu, K. Yong, J. Yu, S. Gao, C. Mao, Z. Yu, Discrete asymmetric zero-shot hashing with application to cross-modal retrieval, Neurocomputing, 511 (2022), 366–379. https://doi.org/10.1016/j.neucom.2022.09.037 doi: 10.1016/j.neucom.2022.09.037
[33]	Z. Shu, Y. Bai, D. Zhang, J. Yu, Z. Yu, X. J. Wu, Specific class center guided deep hashing for cross-modal retrieval, Inf. Sci., 609 (2022), 304–318. https://doi.org/10.1016/j.ins.2022.07.095 doi: 10.1016/j.ins.2022.07.095
[34]	M. Su, G. Gu, X. Ren, H. Fu, Y. Zhao, Semi-supervised knowledge distillation for cross-modal hashing, IEEE Trans. Multimedia, 25 (2021), 662–675. https://doi.org/10.1109/TMM.2021.3129623 doi: 10.1109/TMM.2021.3129623
[35]	J. A. Portillo-Quintero, J. C. Ortiz-Bayliss, H. Terashima-Marín, A straightforward framework for video retrieval using clip, in Mexican Conference on Pattern Recognition, (2021), 3–12. https://doi.org/10.1007/978-3-030-77004-4_1
[36]	M. Patrick, P. Y. Huang, Y. Asano, F. Metze, A. Hauptmann, J. Henriques, et al., Support-set bottlenecks for video-text representation learning, preprint, arXiv: 2010.02824.
[37]	Z. Wang, X. Liu, H. Li, L. Sheng, J. Yan, X. Wang, et al., Camp: Cross-modal adaptive message passing for text-image retrieval, in Proceedings of the IEEE/CVF International Conference on Computer Vision, (2019), 5764–5773. https://doi.org/10.1109/ICCV.2019.00586
[38]	H. Diao, Y. Zhang, L. Ma, H. Lu, Similarity reasoning and filtration for image-text matching, in Proceedings of the AAAI Conference on Artificial Intelligence, 35 (2021), 1218–1226. https://doi.org/10.1609/aaai.v35i2.16209
[39]	Y. Liu, H. Liu, H. Wang, F. Meng, M. Liu, BCAN: Bidirectional correct attention network for cross-modal retrieval, IEEE Trans. Neural Netw. Learn. Syst., (2023). https://doi.org/10.1109/TNNLS.2023.3276796 doi: 10.1109/TNNLS.2023.3276796
[40]	S. K. Gorti, N. Vouitsis, J. Ma, K. Golestan, M. Volkovs, A. Garg, et al., X-pool: Cross-modal language-video attention for text-video retrieval, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2022), 5006–5015. https://doi.org/10.1109/CVPR52688.2022.00495
[41]	H. Chen, G. Ding, X. Liu, Z. Lin, J. Liu, J. Han, Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2020), 12655–12663. https://doi.org/10.1109/CVPR42600.2020.01267
[42]	F. Faghri, D. J. Fleet, J. R. Kiros, S. Fidler, Vse++: Improving visual-semantic embeddings with hard negatives, preprint, arXiv: 1707.05612.
[43]	Y. Yu, J. Kim, G. Kim, A joint sequence fusion model for video question answering and retrieval, in Proceedings of the European Conference on Computer Vision (ECCV), (2018), 471–487. https://doi.org/10.1007/978-3-030-01234-2_29
[44]	A. Rohrbach, M. Rohrbach, N. Tandon, B. Schiele, A dataset for movie description, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2015), 3202–3212. https://doi.org/10.1109/CVPR.2015.7298940
[45]	N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: A simple way to prevent neural networks from overfitting, J. Mach. Learn. Res., 15 (2014), 1929–1958.
[46]	Z. Xie, I. Sato, M. Sugiyama, Stable weight decay regularization, preprint, arXiv: 2011.11152.
[47]	S. Liu, H. Fan, S. Qian, Y. Chen, W. Ding, Z. Wang, Hit: Hierarchical transformer with momentum contrast for video-text retrieval, in Proceedings of the IEEE/CVF International Conference on Computer Vision, (2021), 11915–11925. https://doi.org/10.1109/ICCV48922.2021.01170
[48]	J. Wang, Y. Ge, R. Yan, Y. Ge, K. Q. Lin, S. Tsutsui, et al., All in one: Exploring unified video-language pre-training, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2023), 6598–6608. https://doi.org/10.1109/CVPR52729.2023.00638
[49]	J. Lei, L. Li, L. Zhou, Z. Gan, T. L. Berg, M. Bansal, et al., Less is more: Clipbert for video-and-language learning via sparse sampling, in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, (2021), 7331–7341. https://doi.org/10.1109/CVPR46437.2021.00725
[50]	H. Fang, P. Xiong, L. Xu, Y. Chen, Clip2video: Mastering video-text retrieval via image clip, preprint, arXiv: 2106.11097.
[51]	J. Lei, T. L. Berg, M. Bansal, Revealing single frame bias for video-and-language learning, preprint, arXiv: 2011.11152.
[52]	L. Li, Z. Gan, K. Lin, C. C. Lin, Z. Liu, C. Liu, et al., Lavender: Unifying video-language understanding as masked language modeling, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2023), 23119–23129. https://doi.org/10.1109/CVPR52729.2023.02214
[53]	H. Luo, L. Ji, M. Zhong, Y. Chen, W. Lei, N. Duan, et al., Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning, Neurocomputing, 508 (2022), 293–304. https://doi.org/10.1016/j.neucom.2022.07.028 doi: 10.1016/j.neucom.2022.07.028
[54]	F. Cheng, X. Wang, J. Lei, D. Crandall, M. Bansal, G. Bertasius, Vindlu: A recipe for effective video-and-language pretraining, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2023), 10739–10750. https://doi.org/10.1109/CVPR52729.2023.01034

Reader Comments

Your name:*

Email:*
© 2023 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

Mathematical Biosciences and Engineering

3.9

Metrics

Article views(1564) PDF downloads(53) Cited by(1)

Preview PDF

Download XML

Export Citation

Article outline

Show full outline

Mathematical Biosciences and Engineering

A cross-modal conditional mechanism based on attention for text-video retrieval