
In this paper, some new findings on the uniqueness and existence of positive periodic solutions to first-order functional differential equations are presented. These equations have wide applications in a variety of fields. The most important feature of our argument is that we use the theory of Hilbert's metric to prove the uniqueness of the positive periodic solution when q=−1 and −1<q<0. In addition, we also investigate the existence results of positive periodic solutions by applying a fixed point theorem for completely continuous maps in a cone. Two examples demonstrate our findings.
Citation: Jiaqi Xu, Chunyan Xue. Uniqueness and existence of positive periodic solutions of functional differential equations[J]. AIMS Mathematics, 2023, 8(1): 676-690. doi: 10.3934/math.2023032
[1] | Ahlam Almulhim . Signed double Italian domination. AIMS Mathematics, 2023, 8(12): 30895-30909. doi: 10.3934/math.20231580 |
[2] | Abel Cabrera-Martínez, Andrea Conchado Peiró, Juan Manuel Rueda-Vázquez . Further results on the total Italian domination number of trees. AIMS Mathematics, 2023, 8(5): 10654-10664. doi: 10.3934/math.2023540 |
[3] | Linyu Li, Jun Yue, Xia Zhang . Double total domination number of Cartesian product of paths. AIMS Mathematics, 2023, 8(4): 9506-9519. doi: 10.3934/math.2023479 |
[4] | Abel Cabrera Martínez, Iztok Peterin, Ismael G. Yero . Roman domination in direct product graphs and rooted product graphs. AIMS Mathematics, 2021, 6(10): 11084-11096. doi: 10.3934/math.2021643 |
[5] | Abel Cabrera-Martínez, Andrea Conchado Peiró . On the {2}-domination number of graphs. AIMS Mathematics, 2022, 7(6): 10731-10743. doi: 10.3934/math.2022599 |
[6] | Ana Klobučar Barišić, Antoaneta Klobučar . Double total domination number in certain chemical graphs. AIMS Mathematics, 2022, 7(11): 19629-19640. doi: 10.3934/math.20221076 |
[7] | Rangel Hernández-Ortiz, Luis Pedro Montejano, Juan Alberto Rodríguez-Velázquez . Weak Roman domination in rooted product graphs. AIMS Mathematics, 2021, 6(4): 3641-3653. doi: 10.3934/math.2021217 |
[8] | Fu-Tao Hu, Xing Wei Wang, Ning Li . Characterization of trees with Roman bondage number 1. AIMS Mathematics, 2020, 5(6): 6183-6188. doi: 10.3934/math.2020397 |
[9] | Zepeng Li . A note on the bounds of Roman domination numbers. AIMS Mathematics, 2021, 6(4): 3940-3946. doi: 10.3934/math.2021234 |
[10] | Zongpeng Ding . Skewness and the crossing numbers of graphs. AIMS Mathematics, 2023, 8(10): 23989-23996. doi: 10.3934/math.20231223 |
In this paper, some new findings on the uniqueness and existence of positive periodic solutions to first-order functional differential equations are presented. These equations have wide applications in a variety of fields. The most important feature of our argument is that we use the theory of Hilbert's metric to prove the uniqueness of the positive periodic solution when q=−1 and −1<q<0. In addition, we also investigate the existence results of positive periodic solutions by applying a fixed point theorem for completely continuous maps in a cone. Two examples demonstrate our findings.
Object recognition aims at detecting the objects and predicting the class label of a given image, which has been widely used in classification [1,2], localization [3,4,5,6,7], segmentation [8,9], retrieval [10,11,12] and natural language processing [13,14], etc. The significant advances have been reported in a large number of deep learning literatures [15,16,17,18,19]. Despite the exciting success, most methods proposed in those papers are based on supervised learning, which is driven by the availability of manually annotated instances with powerful low-level visual features [7]. However, the frequencies of objects in the wild follow a long-tailed distribution that consist of a few common classes and most rare classes [20]. On one hand, it is difficult for rare classes without sufficient representative labeled instances to train a classifier effectively. Moreover, it is extremely challenging to collect large-scale labeled instances, even if the performance of the model is improved by adding more instances. Taking the large-scale dataset ImageNet [21] as an example, it contains a total of 14M images in 21,841 classes. It is unrealistic to exhaustively annotate hundreds of instances for each class. On the other hand, the labeled instances of certain classes are precious and difficult to obtain significant amount of the corresponding annotated instances, e.g., endangered bird breed in fine-grained datasets, which is hard to annotate images without expert knowledge [22], let alone collecting instances. In addition, new objects emerge over time that are not covered by known classes and have no labeled instances beforehand, e.g., the high-quality radiology images of the patients infected by COVID-19 are not available before 2019. As a result, the conventional approaches cannot tackle above problems. There are increasing efforts to address the problem of insufficient or even no labeled instances, such as one-shot and few-shot learning [23] deal with the classes of few labeled instances; open world recognition performs the tasks: detecting the novelty of the test classes via open set recognition that was initially proposed by [24], progressively labeling instances of novel unseen classes by class-incremental learning, and adapting the model to classify the acquired labeled instances [25]. The above-mentioned techniques reduce the dependence on labeled instances and improve the accuracy, but still require at least some labeled instances for model learning. Unfortunately, the aforementioned strategies fail to determine the class labels of the instances belonging to unseen classes that have no labeled data.
In contrary, humans have the ability to recognize unseen classes by intelligently utilizing the previously learned knowledge extracted from the seen classes. For example, a learner can easily recognize the Persian fallow deer, if he/she has ever seen fallow deer and is aware that it resembles the fallow deer with bigger antlers and white spots around the neck. Therefore, they are capable of distinguishing beyond 30,000 objects [26] as well as varieties of the subordinates. Inspired by the mechanism of human's ability to recognize new objects without seeing all classes in advance, Zero-Shot Learning (ZSL) [27,28,29,30,31]has drawn significant attention and is proposed to recognize the entirely novel classes omitted from training instances by extrapolation from the knowledge contained in the observed classes [32]. More specifically, given labeled training instances of seen classes in the source domain, ZSL aims to establish a model to classify the instances of unseen classes in the target domain, which increasingly reduces the resources in labor and time expenses. In addition to computer vision related to images, the applications of ZSL has been emerging in various fields, such as zero-shot translation [33], bilingual dictionary induction [34] and molecular compound analysis [35].
In the absence of the labeled instances of unseen classes, the key idea underpinning ZSL methods is to explore the knowledge that transfer via shared auxiliary information. Seen classes are associated with unseen classes in a common space, i.e., semantic space, and the high-level semantic representations are considered as auxiliary information among these classes. Thereby, they can act as a bridge to guarantee the feasibility of ZSL. Generally, there are multiple types of semantic information: attributes [36], word vectors [37,38,39], textual descriptions [40], hierarchical ontology [41,42], etc. The commonly used semantic space nowadays is attribute space [27]. Each class is endowed with a unique semantic prototype [43] in this space. The prototype is specified by a binary or continuous attribute vector that indicates the class properties manually designed by experts. The relatedness of the classes is represented by the similarity of the semantic prototypes, e.g., the semantic prototype of zebra is closer to that of horse instead of pig, which is agreement to the reality that zebra is semantically related to horse. Therefore, ZSL can learn a model properly with the aid of semantic representations. Most existing ZSL approaches [27,40,41,44,45,46] exploit a visual-semantic projection to reflect the relationship among the classes. Specifically, the projection is learned to map the low-level visual features of the labeled instances consisting of seen classes only to semantic space during training. At test stage, the learned projection function is applied to map the target instances of unseen classes to the same semantic embedding space where seen and unseen classes reside. Then, the similarities of the predicted semantic presentations and prototypes are measured by certain matric. Employing the nearest neighbor (NN) search, the classification of the target instance is realized by aligning the semantic prototype of unseen class that yields the highest score.
Despite the success of those semantic embedding models, the largest challenge in ZSL is the projection domain shift problem [43] among the disjoint seen and unseen classes and is manifested through the following aspects. On the one hand, the visual feature space is mutually independent of semantic space, and they have distinct distributions. Hence, there is great difficulty in learning an effective and compatible projection function between the two spaces. On the other hand, the visual appearance of the same attributes in seen and unseen classes are fairly different. The discrepancy is analyzed empirically in [43]. It can be seen that shared characteristic "has tail" in target unseen class Pig is visually different from the source seen class Zebra. Thus, there are significant differences in the underlying distributions of the classes that leads to poor performance on novel classes. In other words, if the projection functions learned with the training instances in seen classes are directly adopted to the unseen classes without adaptation, the target instance tends to be shifted far away from the corresponding class prototype, resulting in the unsatisfactory recognition by NN search at test stage.
There is a recent surge of interest in building a better generalizable projection function on the novel classes to be less susceptible to domain shift. Firstly, a large volume of the literatures belongs to inductive setting are published to overcome this problem [27,32,47,48,49]. The most representative one is SAE [48] and it learns the linear projection function from visual feature space to semantic space based on auto-encoder paradigm, in which the decoder is the transpose of encoder and imposed by a reconstruction constraint of the original visual features. However, inductive methods only have access to the seen data, the projection is likely to capture the characteristics of the seen classes rather the unseen ones. As a result, it hinders the effective generalization. Secondly, the generative models are proposed to compensate for the visual features of target unseen data. The two prominent members are Generative Adversarial Networks (GANs) [50,51,52,53,54] and Variational Auto-Encoders (VAEs) [55] that synthesize visual features by utilizing the semantic prototypes of unseen classes. While, it is noticed that the choice of the semantic prototype is essential, as low quality may degrade the effectiveness of the generator. Afterwards, various methods resort to transductive learning [43,56,57,58,59,60,61,62] leverage the unlabeled target instances during training. The existing transductive learning methods are classified into three categories. The first one is label propagation. For example, Fu et al. [43] combines multiple semantic representations with visual features of unseen classes to learn a joint embedding space, in which the target data are aligned with the label embeddings and then the recognition is performed via label propagation. The second one is self-training that progressively improves the classification capacity in an iterative refining process [59]. The last one termed domain adaptation is the most relevant method to our model and has been well-investigated to uncover the common knowledge of the source and target domains [63,64,65]. Different from the above first two strategies, [65] simultaneously utilize the visual features and semantic prototypes of unseen classes, our model only utilizes the visual features of the unlabeled instances. Furthermore, our model is not concerned with the distribution alignment of the projected and original domains [63] or that of the features in the immediate space [64], whereas the latent space in our model is semantically meaningful and encourages learning the generalizable semantics containing sufficient information of the visual features through a reconstruction task.
In this paper, we develop a novel model by exploiting the idea of autoencoder framework to solve zero-shot challenges and reveal the relationship between visual features and semantic representations. We assume that the majority semantic properties of the unseen classes are shared with that of seen classes. Following the previous work, we adopt the semantic space as the latent embedding space to preserve the semantic relatedness between the classes. Motivated by [48,63], our model takes advantage of the bi-shifting linear auto-encoder framework. In specific, the common encoder shared by source and target domains tries to learn the projection from visual feature space to semantic space. Considering the distribution divergence of the disjoint domains, the original features are reconstructed by two different decoders based on the learned semantic representations. It is worth mentioning that there are two regularization terms in our model. Inspired by [48], the first term is designed to inherit the properties of semantic space by incorporating the semantic prototypes of the seen classes and then the projections are constrained to force the learned semantics of unlabeled target instances as close as possible to their class prototypes. Consequently, the semantic mismatch between visual features and semantic representations can be refined. The second regularization term is adopted to enforce that the decoder in target domain is derived from, rather than the same as the decoder supervised learned via semantics of instances in source domain, resulting in truthfully reconstruct the visual features. To that end, we design a novel Bi-shifting Semantic Auto-Encoder (hereafter referred as BSAE) architecture that integrates the merits of both domain adaptation and discriminative ability of class semantics, as shown in Figure 1. However, BSAE is based on linear auto-encoder and quite shallow, experimental results reported in section 4 prove its outstanding performance. For example, the average accuracy on five benchmark datasets under different protocols are whopping 6.9% improvement over the current state-of-the-art. In conclusion, the main contributions are three-fold:
● A simple and effective ZSL model termed BSAE is developed in an auto-encoder framework. Our model not only alleviates the domain shift problem, but also recovers the interaction between visual feature and semantic representation.
● We consider ZSL as the problem of learning the projection functions to explore shared discriminative semantic representations of instances, which are supervised by the semantic prototypes of the seen instances. Meanwhile, the generalizable capability of the learned semantics are enhanced by exploiting the visual features of unlabeled instances.
● An iterative algorithm with high computational efficiency is introduced to solve the problem. Extensive experimental results demonstrate that our approach achieves superior performance on five benchmark datasets, even if the class prototypes of the unseen classes are not available.
The remainder of the paper is organized as follows. In Section 2, we briefly review the related work proposed to overcome the challenges in ZSL. In Section 3, we describe the proposed model and deduce an efficiently iterative algorithm. The results are reported and discussed in Section 4. Finally in Section 5, we present the conclusion and propose several research directions to be investigate.
In this section, we firstly introduce a review about the semantic space exploited in current zero-shot learning. Then, we briefly review of projection learning concerned with our work. Finally, we present the related advances to relive the domain shift problem.
Semantic representation shared between classes bridges the gap in ZSL and enables the transmission of common knowledge from seen to unseen classes. There are various semantic spaces formed by different class embeddings. Attribute space [27] is the most popular and effective one [46,66,67], in which the properties of the classes are described as attributes. However, manually collecting and annotating attributes are heavy reliance on the efforts of experts. The word vector [38,39] and text description [40] based semantic space are proposed because of relatively less labor intensive. The semantic representations are automatically extracted by embedding models from text corpus (e.g., Wikipedia). In spite of the inconvenience for humans to incorporate the knowledge of the classes into the semantics, as reported in [40], 10 sentence descriptions are collected for each image to construct the semantic space, which is even more expensive than annotating attributes. Moreover. SJE [41] and ESZSL [47] have shown that attribute space is more effective than word vector space. Besides, several ZSL methods take advantage of them via combining the aforementioned semantic spaces [50,60,68]. In our work, we consider the attribute space as semantic space.
The existing ZSL models can be sub-categorized into three groups, depending on how the projection function is established.
The first group learns a forward projection from visual feature space to semantic space. Lampert et al. [27] proposed two attribute-based classifiers includes direct attribute predictor (DAP) and indirect attribute predictor (IAP) that exploit the attributes to predict the class labels of instances in a two-stage schema. SOC [32] firstly projects the visual features into the semantic space, and then determines the class label through KNN. CONSE [37] exploits a probabilistic model and then predicts the unseen classes via the convex combination of the class-embeddding vectors. DeViSE [38] applies linear corresponding function by combining similarity and hinge ranking loss. ESZSL [47] learns the bilinear compatibility function by optimizing square loss. To optimize the ranking loss, ALE [58] employs a bilinear mapping compatibility function.
The second group learns a reverse projection from the semantic space to the visual space to rectify the hubness problem [69]. The hubness refers to the phenomenon of some semantic prototypes are nearest neighbors of instances from different classes, which is a curse of demensionality. Zhang et al. [70] proposed a deep end-to-end neural network to embed the class prototypes into the visual feature space that suffer much less from the hubness problem, as discussed in [71]. In addition to the embedding models, generative-based methods are proposed recently to generate instances for unseen classes by leveraging the semantic prototypes, then the ZSL problem is converted into a traditionally supervised problem. f-CLSWGAN [50] and LisGAN [68] explore the conditional generator on semantics to synthesize the visual features. However, it is hard to train generative models because of the min-max optimization. Auto-encoder is an effective framework to extract the representative features in a unsupervised manner and alleviate the domain shift problem. Xu et al. [52] construct visual feature space as latent layer and learns two different regressors for semantic reconstructions. The latent layer of [48] is semantic space and the linear projection between the visual feature and semantic space is learned with the semantic constraint of seen classes to reconstruct the original data. [72] improves the model in [48] by adding a regularization constraint of the projection function, thereby ensuring that the structural risk of the model is minimized.
In the last group, both visual features and semantic representations are projected into a common space. SYNC [49] learns classifiers of unseen classes by linearly combining base classifiers. Zhang and Saligrama [67] leverage similar class relationships in the common space, which is defined by the seen classes proportions.
Taking full advantage of the first two groups, our model is close to [63] in which a bi-shifting auto-encoder is employed for reconstructing visual features in different domains. Different with [63] that apply nonlinear projections to learn the representations in latent space, our model reinforces the latent space as semantic space with class semantic prototypes and exploits the linear projection functions to fit the distributions of visual and semantic spaces, respectively.
Domain shift problem was firstly reported in [43] and is an open issue in ZSL. It describes that the projection functions learned from the seen classes are biased when exploit them to map instances of unseen classes from visual feature space to semantic space. It is essentially caused by the disjoint seen and unseen classes with different underlying data distributions. The researchers have investigated how to rectify the domain shift problem and obtain competitive results, for instance, SAE [48] imposes an additional reconstruction constraint to the training seen data, resulting in the learned projection function more generalizable across seen and unseen classes. LisGAN [68] refines the domain shift via generating the soul instances related to the semantic representations. However, as the unseen class data are not involved in the model learning, the generalizable ability of the inductive methods is limited. Transductive ZSL is an emerging topic to mitigate the domain shift problem where not only labeled seen class data are available, but also has access to unlabeled unseen class data, which potentially leads to improvements in classification performance. Fu et al. [43] first propose a transductive multi-view embedding framework, and then generate the class labels for unseen class data via label propagation. Kodirov et al. [65] formulate a regularized sparse coding framework to solve the domain shift problem. A measure of inter-class semantic consistency is proposed by [73] to explore the relation between the semantic manifold and visual-semantic projection on seen classes. VCL [56] proposes a visual structure constraint on class centers. Unlike SAE [48] exploits one decoder to reconstruct the features without domain adaptation, our model employs transductive setting and adopts two decoders to reconstruct the visual features in the source and target domains. Additionally, we restrict the similarity constraint of the two different decoders as a regularizer by considering the amount of adaptation from the labeled seen class data rather than being deviated freely. Although we only use the visual features of unseen data rather than the combination of the visual features and semantic representations of target unseen data like others [41,70], our model boosts ZSL performance.
In this section, we describe the procedures and methods used in this paper. we firstly set up the zero-shot learning problem, then develop our novel model BSAE for this task, and finally derive an efficient algorithm to solve it. Subsequently, the classification of unseen classes can be performed in the original feature space and semantic space.
We start by introducing some notations and problem definition of our interest. Considering m labeled source instances S={(xsi,ysi,ssi)|xsi∈X,ysi∈Cs}mi=1 are given as training data, where xsi∈Rd denotes the d-dimensional visual feature, ysi is the corresponding class label in Cs consisting of τ discrete seen classes, ssi is the semantic representation of ith instance. In addition, given p unlabeled target data T={(xti,yti,sti)|xti∈X,yti∈Ct}pi=1 of unseen classes, where xti∈Rd denotes the d-dimensional visual feature, yti is the corresponding label and belongs to the μ unseen classes set Ct. While the seen and unseen classes are disjoint, i.e., Cs∩Ct=∅, the semantic space A are associated with mitigating this challenge, which is spanned by attribute vector or word vector derived from text for each class. The k-dimensional semantic prototypes of seen and unseen classes are denoted as As=[as1,as2,…,asτ]∈Rk×τ and At=[at1,at2,…,atμ]∈Rk×μ. Therefore, Str=[ss1,ss2,…,ssm]∈Rk×m is given because the source data Xs=[xs1,xs2,…,xsm]∈Rd×m are labeled by either binary or continuous attributes indicating the corresponding class labels Ys={ysi}mi=1. On the contrary, as the target instances Xt=[xt1,xt2,…,xtp]∈Rd×p are unlabeled, Ste=[st1,st2,…,stp]∈Rk×p that stands for the semantic prototypes and Yt={yti}pi=1 that denotes the class labels have to be predicted.
The goal of standard zero-shot learning is to predict the correct class of Xt by learning a classifier f:T→Ct. The key notations used in this paper are listed in Table 1.
Notations | Descriptions |
Xs∈Rd×m | Visual feature of source instances |
Xt∈Rd×p | Visual feature of target instances |
Ys={ysi}mi=1 | Class labels of Xs |
Yt={yti}pi=1 | Class labels of Xt |
Cs={1,2,…,τ} | Set of τ seen classes |
Ct={1,2,…,μ} | Set of μ unseen classes |
At∈Rk×μ | Semantic prototypes of Ct |
Str∈Rk×m | Semantic representations of Xs |
Ste∈Rk×p | Semantic representations of Xt |
λ1,λ2 | Hyper-parameters |
X | d-dimensional visual feature space |
A | k-dimensional semantic space |
m,p | Number of source and target instances |
d,k | Dimensionality of visual feature and semantic space |
We begin our discussion with auto-encoder (AE) for it being the basis of our model. The simplest form of AE is linear and has one hidden layer [48] that is responsible to truthfully reconstruct the input data as similar as possible. We force the semantic space in hidden layer as [48] so that the latent space is semantically meaningful, e.g., each column of Str stands for the attribute vector of the corresponding labeled source instance.
Assume that labeled seen-class training set S and unlabeled target data Xt are available. The proposed BSAE aims to learn a model to estimate the discriminative semantic representations Ste and reconstructed features ˆXt of the target instances and then obtain their class labels Yt in semantic space and visual feature space, respectively. Specifically, considering the seen classes and unseen classes are related in the same class embedding space (e.g., attribute), BSAE consists of three components: (1) It attempts to learn the encoder parameterized by W0∈Rk×d (k<d) to project both domains from visual feature space X to the common semantic space A. In order to guarantee whether the learned semantic representations capture sufficient discriminative information, in terms of the distribution discrepancy between domains, (2) on one hand, the decoder W1∈Rd×k reconstructs the original visual features of source domain exactly. (3) On the other hand, the mapped class embeddings of target domain are projected to the visual features by decoder W2∈Rd×k. We simultaneously minimize the reconstruction errors in different domains by utilizing the unlabeled instances from unseen classes to narrow down the domain gap. Therefore, it is applicable to better generalize the learned regression model to unseen classes. As observed from Figure 2, our model preserves enough discriminative information across unseen classes, even in the low dimensional semantic space that exacerbates the hubness problem.
Our model is learned by optimizing the following objective:
minW0,W1,W2 J=‖Xs−W1W0Xs‖2F+‖Xt−W2W0Xt‖2Fs.t. W0Xs=Str, | (3.1) |
where ‖⋅‖F is the Frobenius norm of a matrix. Eq 3.1 denotes the loss of the autoencoder. It is difficult to solve the objective Eq 3.1 with a hard constraint. To fight off the constraint W0Xs=Str efficiently, we relax the constraint through incorporating a semantic similarity term into Eq 3.1:
minW0,W1,W2 J=‖Xs−W1Str‖2F+‖Xt−W2W0Xt‖2F+λ1‖W0Xs−Str‖2F | (3.2) |
and λ1 is the hyper-parameter. The first two terms are regarded as the losses of different decoders. The last term is the loss of encoder. W2 is unsupervised because of the unknown semantic representation Ste of target data and [65] proves that W2 adapted from W1 is efficient to this issue. To this end, we adds the regularization term ‖W2−W1‖2F to Eq 3.2 to restrict the amount of adaptation of the two projections. It is worth noting that W1 is considered as a basis to ensure W2 cannot deviate freely from W1. The full objective of our proposed model then becomes:
minW0,W1,W2 J=‖Xs−W1Str‖2F+‖Xt−W2W0Xt‖2F+λ1‖W0Xs−Str‖2F+λ2‖W2−W1‖2F, | (3.3) |
where λ2 is a hyper-parameter used to balance the importance of different terms.
Next, we will formulate our solver as a novel gradient-based algorithm to alternately update projection functions W0, W1 and W2. Note that the conventional iterative algorithms (e.g., Gradient Descent) have been widely exploited to directly solve such problems without computationally efficiency. Whilst our solver depends on the dimension of the features, not the number of instances and hence is more effective than the conventional iterative algorithms. To solve the Eq 3.3, we calculate the partial derivative of it and set it to zero:
∂JW0=−WT2(Xt−W2W0Xt)XTt+λ1(W0Xs−Str)XTs, | (3.4) |
∂JW1=−(Xs−W1Str)STtr−λ2(W2−W1), | (3.5) |
∂JW2=−(Xt−W2W0Xt)XTtWT0+λ2(W2−W1), | (3.6) |
and optimize the following sub-problems through alternative optimization methods.
Setting Eq 3.4 to zero, we obtain:
WT2W2W0+λ1W0XsXTs(XtXTt)−1=WT2+λ1StrXTs(XtXTt)−1. | (3.7) |
Let M0=WT2W2,H0=λ1XsXTs(XtXTt)−1,Q0=WT2+λ1StrXTs(XtXTt)−1, we have the Sylvester equation:
M0W0+W0H0=Q0, | (3.8) |
where M0∈Rk×k and H0∈Rd×d are square matrices, Q0∈Rk×d is a rectangle matrix. The above matrix function has a unique solution if it satisfies conditions of the Theorem 3.1 quoted in [74]. Obviously, it is easy to meet in practical applications.
Theorem 3.1. Eq 3.8 has a unique solution if and only if the matrices M0 and H0 have distinct eigenvalues, that is, the eigenvalues γ1,γ2,…,γk of M0 and ζ1,ζ2,…,ζd of H0 satisfy γi+ζj≠0 (i=1,...,k;j=1,...,d).
As a result, Eq 3.8 can be easily solved by Bartels-Stewart algorithm, which is implemented with a single line of code: sylvester in MATLAB:
W0=sylvester(M0,H0,Q0). | (3.9) |
Setting Eq 3.5 to zero, then we have the following Sylvester equation:
λ2W1+W1StrSTtr=XsSTtr+λ2W2. | (3.10) |
Let M1=λ2Ik, Ik is the k×k identity matrix, H1=StrSTtr, Q1=XsSTtr+λ2W2. The Eq 3.10 can be efficiently solved in MATLAB:
W1=sylvester(M1,H1,Q1). | (3.11) |
Similarly, we set Eq 3.6 to zero, and have the following formulation:
λ2W2+W2W0XtXTtWT0=XtXTtWT0+λ2W1. | (3.12) |
If we denote M2=M1,H2=W0XtXTtWT0,Q2=XtXTtWT0+λ2W1, the above Sylvester equation (3.12) can be solved in MATLAB:
W2=sylvester(M2,H2,Q2). | (3.13) |
Algorithm 1 summarizes the implementation of our algorithm. We simply initialize W2 with all elements of 0.1 for coarse-grained datasets (e.g., AWA1) and all elements of 0.01 for fine-grained datasets (e.g., CUB). The hyper-parameters λ1 and λ2 are selected by cross-validations. Details are listed in section 4. The iterations will terminate when the Eq 3.3 converges or reaches a fixed number of iterations.
Algorithm 1 Bi-shifting Semantic Auto-Encoder |
Input: Training data Xs, Str Test data Xt Hyper-parameters λ1, λ2 Output: Projection matrices W0, W2 1: Initialize W2 2: while not converge do 3: Update W0 by Eq 3.9 4: Update W1 by Eq 3.11 5: Update W2 by Eq 3.13 6: Check the converge condition 7: end while 8: Return W0, W2 |
To this end, we propose to briefly explain the analysis of the time complexity and convergence of our algorithm. As mentioned in Algorithm 1, optimizing the objective function Eq 3.3 is actually the process of solving three Sylvester equations. The time complexity of computing each Sylvester equation, e.g., Eq 3.8, given M0 and H0, is O(k3+d3)(d,k≪min(m,p)), which is independent of number of instances. In other words, it can be effectively applied to large-scale datasets. As can be observed from Eq 3.7 to Eq 3.13, due to linear formulations, it is easier to solve three sub-problems with respect to three projection functions W0, W1 and W2 in our proposed model. Concretely, updating each projection function is regarded to solve Sylvester equation. Hence, the objective function Eq 3.3 is non-increasing with a lower bound during the alternative optimization.
According to Algorithm 1, we obtain the optimal projection functions W0 and W2. We measure the similarity score between the estimated value of target instance and its prototype, and then predict the class label.
In the semantic space, considering a target instance xti, we could firstly calculate the estimated semantic representation with (Ste)i=W0xti, then compare with the prototypes At of classes in Ct by calculating the cosine distance between them:
l(xti)=argminj d((Ste)i,(At)j), | (3.14) |
where j∈[μ], (At)j is the prototype attribute vector of j–th unseen class and d(⋅,⋅) is a distance function. l(⋅) returns the class label of a target instance.
In the feature space, it is worth mentioning that the predicted visual features ˆXt of unseen classes are easily synthesized by embedding the semantic prototypes of Ct to the visual feature space with ˆXt=W2At. Hence, the ZSL is converted to a conventional classification problem. Empirically, any supervised classifier can be utilized. We simply exploit k-Nearest Neighbor (KNN) to demonstrate the capability of our decoder W2. Similar to the process in 1), the class label of target instance can be inferred by calculating the cosine distance between the prototype projections and the original visual feature xti:
l(xti)=argminj d(xti,ˆXtj), | (3.15) |
where ˆXtj is the j–th unseen class prototype projected into the visual feature space.
In this section, we firstly introduce our experimental protocols in detail, then we present our results that are compared with the state-of-the-art approaches on five small-scale benchmark datasets (AWA1, AWA2, CUB, SUN, aP & Y) for conventional zero-shot learning (CZSL) task.
Five benchmark datasets are selected from the widely used datasets for ZSL: AwA1 (Animals with Attributes 1) [27], AWA2 (Animals with Attributes 2) [75], aP & Y (Attribute Pascal and Yahoo) [76], CUB (Caltech-UCSD-Birds 200-2011) [22] and SUN (Scene UNderstanding) [77]. We exploit two typical protocols to evaluate the performance of our model: standard splits (SS) [27] and proposed splits (PS) [75]. More concretely, SS is widely used in previous works, but the weakness is that unseen classes are subset of ImageNet during training, resulting in violating the true zero-shot rule. On the contrary, PS ensures that none of the unseen classes used for pre-training the ResNet belong to 1K classes of ImageNet. The fact that PS is much more difficult than SS on account of low correlation between seen and unseen classes. For clarity, the statistics of these datasets are briefly reported in Table 2.
At Training Time | At Testing Time | |||||||||
Datasets | Granularity | Size | Attributes | Cs/Ct | Images | SS (Cs) | PS (Cs) | SS (Ct) | PS (Ct) | |
AWA1 | coarse | medium | 85 | 40/10 | 30,475 | 24,295 | 19,832 | 6180 | 10,643 | |
AWA2 | coarse | medium | 85 | 40/10 | 37,322 | 30,337 | 23,527 | 6985 | 13,795 | |
aP & Y | coarse | small | 64 | 20/12 | 15,339 | 12,695 | 5932 | 2644 | 9407 | |
CUB | fine | medium | 312 | 150/50 | 11,788 | 8855 | 7057 | 2933 | 4731 | |
SUN | fine | medium | 102 | 645/72 | 14,340 | 12,900 | 10,320 | 1440 | 4020 |
We take advantage of semantic space spanned by continuous attributes like the pioneering works [27,47,48]. Each instance is associated with the corresponding continuous class-level attribute. The dimension of the semantic space equals to that of the attributes, e.g., the semantic space of AWA1 is formed by 85-dim attributes. The dimensions of the attributes of all datasets are listed in Table 2.
Following the general procedure in other literatures, we use visual features extracted by deep convolutional neural networks (CNNs) and GoogleNet features [78] which is the 1024-dim activation of the final pooling layer as in [41]. Furthermore, the latest works adopt the pre-trained 2048-dim ResNet features, which are extracted by 2048-dim top layer pooling units of the 101-layered ResNet, to achieve improved performance [75]. It is worth noting that the ResNet features has two protocols, namely, SS and PS. For GoogleNet features, only SS is provided. For fair comparison, we do not perform any image pre-processing or any other data augmentation techniques, and conduct extensive experiments on the above two types of features.
To demonstrate the capability of BSAE, we evaluate on the conventional ZSL (CZSL) setting: Assume that the search space is restricted to the unseen classes, the goal is to predict the class labels of Ct at test stage. We use SS and PS protocols in this setting.
Our BSAE model has two hyper-parameters: λ1 and λ2 (see Eq 3.3). We select λ1 and λ2 from {10−2,10−1,1,10,102,103,104} for cross-validation. Considering the two split protocols, we propose tuning these parameters in different ways. For SS protocol, the parameters are chosen by means of class-wise cross-validation on Cs as in [67], that is, two seen classes are randomly selected form a validation set in each iteration to choose the best hyper-parameter {λ1, λ2} and use them for testing on unseen classes. For PS protocol, we perform hyper-parameter search on a disjoint set of validation set of 13 (AWA1/AWA2), 5 (AP & Y), 50 (CUB) and 65 (SUN) classes respectively [75]. Note that we report the average performance for ensuring the significance of the results.
Most ZSL methods use Top-1 accuracy (e.g., [48]) averaged for all images, where the prediction is correct for the predicted class is coincide with ground-truth. However we are concentrated on high performance of both densely and sparsely populated classes. Therefore, under CZSL setting, we evaluate our method on the benchmark datasets by using per-class top-1 accuracy proposed in [75]. We compute the top-1 accuracy independently for each class, and then average for all unseen classes:
accCt=1μμ∑c=1#correct predictions in c#instances in c. | (4.1) |
We evaluate our proposed framework for zero-shot learning on several benchmark datasets. The competitors are representative, competitive state-of-the-art and recently published that encompass a wide range in zero-shot learning.
In these experiments, the test instances only come from Ct disjoint with the seen classes Cs. We use both SS and PS protocols for more convincing results and the qualitative results are shown in Table 3 and Table 4.
Method | Feature | AWA1 | AWA2 | SUN | CUB | aP & Y | Average |
DAP [27] | R | 57.1 | 58.7 | 38.9 | 37.5 | 35.2 | 45.5 |
IAP [27] | R | 48.1 | 46.9 | 17.4 | 27.1 | 22.4 | 32.4 |
CONSE [37] | R | 63.6 | 67.9 | 44.2 | 36.7 | 25.9 | 47.7 |
DEVISE [38] | R | 72.9 | 68.6 | 57.5 | 53.2 | 35.4 | 57.5 |
CMT [39] | R | 58.9 | 66.3 | 41.9 | 37.3 | 26.9 | 46.3 |
SSE [67] | R | 68.8 | 67.5 | 54.5 | 43.7 | 31.1 | 53.1 |
SJE [41] | R | 76.7 | 69.5 | 57.1 | 55.3 | 32 | 58.1 |
ESZSL [47] | R | 74.7 | 75.6 | 57.3 | 55.1 | 34.4 | 59.4 |
LATEM [42] | R | 74.8 | 68.7 | 56.9 | 49.4 | 34.5 | 56.9 |
ALE [58] | R | 78.6 | 80.3 | 59.1 | 53.2 | 30.9 | 60.4 |
SYNC [49] | R | 72.2 | 71.2 | 59.1 | 54.1 | 39.7 | 59.3 |
G | 72.9 | - | 62.7 | 54.7 | - | - | |
SAE [48] | R | 80.6 | 80.7 | 42.4 | 33.4 | 8.3 | 49.1 |
G | 81.9 | - | 59.7 | 53.6 | 34.5 | - | |
SSZSL [79] | V | 88.6 | - | - | 58.8 | 49.9 | - |
DSRL [57] | V | 87.2 | - | - | 57.1 | 56.3 | - |
STZSL [80] | V | 83.7 | - | - | 58.7 | 54.4 | - |
GFZSL [62] | R | 80.5 | 79.3 | 62.9 | 53 | 51.3 | 65.4 |
TSTD [59] | V | 90.3 | - | - | 58.2 | - | - |
QFSL [61] | R | - | 84.8 | 61.7 | 69.7 | - | - |
VCL [56] | R | 82.0 | 82.5 | 63.8 | 60.1 | - | - |
DEARF [60] | R | 81 | 81.2 | 64.3 | 56.1 | - | - |
BSAE | R | 90.7 | 85.2 | 60.4 | 61.8 | 63.2 | 72.3 |
G | 91 | - | 67.1 | 62.4 | 57.7 | - |
Method | AWA1 | AWA2 | CUB | SUN | aP & Y | Average |
DAP [27] | 44.1 | 46.1 | 40 | 39.9 | 33.8 | 40.8 |
IAP [27] | 35.9 | 35.9 | 24 | 19.4 | 36.6 | 30.4 |
CONSE [37] | 45.6 | 44.5 | 34.3 | 38.8 | 26.9 | 38 |
DEVISE [38] | 54.2 | 59.7 | 52 | 56.5 | 39.8 | 52.4 |
CMT [39] | 39.5 | 37.9 | 34.6 | 33.9 | 28 | 34.8 |
SSE [67] | 60.1 | 61 | 43.9 | 51.5 | 34 | 50.1 |
SJE [41] | 65.6 | 61.9 | 53.9 | 53.7 | 32.9 | 53.6 |
ESZSL [47] | 58.2 | 58.6 | 53.9 | 54.5 | 38.3 | 52.7 |
LATEM [42] | 55.1 | 55.8 | 49.3 | 55.3 | 35.2 | 50.1 |
ALE [58] | 59.9 | 62.5 | 54.9 | 58.1 | 39.7 | 55 |
SYNC [49] | 54 | 46.6 | 55.6 | 56.3 | 23.9 | 47.3 |
SAE [48] | 53 | 54.1 | 33.3 | 40.3 | 8.3 | 37.8 |
DEM [70] | 68.4 | 67.1 | 51.7 | 61.9 | 35 | 56.8 |
GFZSL [62] | 68.3 | 63.8 | 49.3 | 60.6 | 38.4 | 56.1 |
CAVE [55] | 71.4 | 65.8 | 52.1 | 61.7 | - | - |
PSR [81] | - | 63.8 | 56 | 61.4 | 38.4 | - |
TVN [82] | - | 68.8 | 58.1 | 60.7 | - | - |
GAZSL [54] | - | 68.4 | 55.8 | 61.3 | 41.1 | - |
f-CLSWGAN [50] | - | 68.8 | 57.3 | 60.8 | 40.5 | - |
GDAN [51] | - | 67.7 | 51 | 54.8 | 40.4 | - |
LESAE [66] | 66.1 | 68.4 | 53.9 | 60 | 40.8 | 57.8 |
LisGAN [68] | 70.6 | - | 58.8 | 61.7 | 43.1 | - |
SRGAN [53] | 71.9 | - | 55.4 | 62.2 | - | - |
DEARF [60] | 72.1 | 69.3 | 38.5 | 48.6 | - | - |
BSR [52] | - | 68.4 | 57.7 | 61.2 | 41.3 | - |
BSAE | 72.3 | 69.4 | 59.7 | 58.6 | 53 | 62.6 |
For SS protocol, the SAE [48] is close to our model while lacks of the results of per-class top-1 accuracy with GoogleNet features. For fair comparison, we recreate SAE by following the settings in their original paper and exploit the same classifier to predict the class labels. Leveraging the code available online, we re-implement GFZSL [62] and SYNC [49] to obtain the recognition results. Note that the first 10 methods in Table 3 are cited from [75] and the rest are copied from the original paper. To further verify that our method is not only effective to specific visual features, we implement our model under the SS protocol with 1024-dim GoogleNet features (G) and 2048-dim ResNet features (R).
From comprehensive comparison in Table 3, we witness that: (1) our model achieves the best four of the five evaluations, i.e., AWA1, AWA2, SUN and aP & Y. Specifically, the improvements over the strongest competitor achieve 0.7%, 0.4% and 6.9% on AWA1, AWA2 and aP & Y. For fine-grained dataset SUN that contains more classes and relatively fewer instances per class, while our result of 67.1% is 2.8% higher than the strongest competitor [60]. The accuracy boost can be attributed to the combination of semantic representations and domain adaptation constraints significantly improving the ability for classification. (2) Meanwhile, from Figure 3, we can observe that the overlap between unseen classes of CUB, which is regarded as the well-known complicated dataset, is particularly striking. Moreover, it is hard to learn the visual-semantic projection for the reason that the sparsity of training instances (∼ 60 instances per class). However, our model still performs well on this dataset. It is worth pointing out that our model learns a more effective and stable visual-semantic relation from seen data for unseen data analysis. (3) [73] and [75] demonstrate that VggNet and ResNet features lead to improved results in ZSL than GoogleNet features. While our model using GoogleNet features consistently performs favorably against state-of-the-art on the five benchmarks, especially the best result of 91% on AWA1 and 67.1% on SUN. This provide further evidence that our model achieves good performance on coarse-grained and fine-grained datasets even if the features are not the strongest.
For PS protocol, we keep the same setting as in [75] to make sure the unseen classes at test time do not overlap with the 1K training classes of ImageNet. The first 12 reported results are cited from [75] and others copied from their original paper. Generally, the performance is expected degrade under this stricter settings. From the comparative results listed in Table 4, we can make the observation that the average top-1 per-class accuracy of our model performs 4.8% higher than all others and drops least on coarse-to-fine grained datasets among all methods, which illustrates more significant. Due to the similar idea between auto-encoder and GAN-based model, we compare several representative methods, e.g., LisGAN [68] and SRGAN [53]. Our model outperforms these competitors on four of five datasets, while 3.6% less than SRGAN [53] under the challenging split of SUN. We conjecture that our regression model is over-fitted in terms of scarce training instances of each class.
It is noting that there is no single approach claims the best results on all datasets simultaneously [60]. The aforementioned improvements actually create new baselines in the area of ZSL, given that most of the compared models utilize more complicated nonlinear formulations and some of them combine complementary semantic spaces or even generate richer features for unseen classes. In contrast, we apply only one type of semantic space as well as computational fast linear projection functions, but gain a significant performance boost.
We aim to evaluation the effectiveness of the decoder of BSAE via the image retrieval task, which is defined as searching top matched images by taking the provided semantic prototypes of unseen classes as queries. The ratio of the number of accurately retrieved images to that of all retrieved images, namely precision, is regarded as the measurement. Table 5 reports 5 out of 50 classes in CUB and 5 out of 65 classes in SUN and depicts the qualitative results of our designed model with highest anterior and posterior scores for each unseen class. Specifically, each column is a category, with class name and precision are shown at the top. The first three rows in the middle are the top-3 correctly retrieved instances. The following three rows are the top-3 misclassified instances in each unseen class. Observing form the top correct images, BSAE reasonably captures discriminative visual information only using its semantic prototype. It suggests that the adaptation regularization helps make approxiamate inference of unseen instances. Meanwhile, taking the class in the second column as an example, Pomarine Jaeger and Rhinoceros Auklet are visually similar to Pacific Loon, the discriminative ability of the decoder is not enough to distinguish the visual appearances between them. Due to the strong visual similarity and only a few different attributes among the classes, we further notice that it is hardly recognize these classes without expert knowledge, even for humans.
Grasshopper Sparrow 50.5% | Pacific Loon 88.2% | Rhinoceros Auklet 86.8% | Western Grebe 85.7% | Least Auklet 84.8% | market outdoor 82.6% | recycling plant outdoor 37.8% | van interior 44.7% | subway station platform 54.8% | lecture room 54.5% |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
For straightforward illustration of BSAE in ZSL. We explore t-SNE visualization [83] to compare the visual features with genuine class label (left) and semantic representations with the predicted class labels (right) in Figure 4. Each color represents clustering in the same class and all the features are embedding into two dimensions using t-SNE. It suggests that our model captures the underlying global distribution in the semantic space and performs better on the dataset. It is worth that our model alleviates the hubness problem in the lower dimensional semantic space. Moreover, the instances of the same class are grouped into one cluster in Figure 4, which confirms that the discriminative semantic representations learned by our model are able to cluster visually similar instances. Therefore, our proposed model preserves the local information of target unseen classes that the closeness are kept in the projected semantic representations.
The ROC curve and AUC value depict the tradeoff between specificity (False Positive Rate) and sensitivity (True Positive Rate) as a metric of the performance of our proposed BSAE. Figure 5 shows the results of the ROC curve and AUC value on AWA1 under PS protocol. We can observe that the ROC curve of the 10 unseen classes are close to the top-left corner of the plot, even though using the simplest KNN classifier.
We observed that almost all methods performed worst on aP & Y compared to other datasets. In order to show our experimental results in a more fine-grained manner, we take the PS protocol of aP & Y dataset as an example, compared with the best competitor LisGAN [68]. Figure 6 shows the confusion matrix of LisGAN and our model i.e., BSAE on the 12 unseen classes. The value in the diagonal of the confusion matrix indicates the ratio of the correctly predicted of each class. The darker color represents the higher class-wise accuracy. It can be seen that BSAE generally performs better on the most classes. Concretely, we boosts 38%, 10%, 24%, 49%, 41%, 3% and 3% on "horse", "motorbike", "person", "sheep", "goat", "jetski" and "statue" against LisGAN respectively. Although the GAN model directly handle zero-shot problem by converting it to a supervised task, we find that our model perform better than GAN-based model, i.e., LisGAN. In addition, it is common that one model does not have the highest accuracy on each unseen class. There will be great improvement in the future.
We measure the inter-class and intra-class distances to investigate BSAE can alleviate domain shift and hubness problem. We follow the two measurements provided by [84]:
Dcintra=1nc∑iD(φ(Ac),ψ(sci)), | (4.2) |
Dcinter=1C−1∑j≠cD(φ(Ac),φ(Aj)). | (4.3) |
Where nc represents the data size of the cth class. C means the number of the classes. φ(⋅) and ψ(⋅) denote the two dimensional outputs of t-SNE [83]. D(⋅) is the cosine distance, which reflects the degree of similarity between actual and the compared one [81]. Dcintra stands for the mean distance between the cth class prototype Ac and semantic representations of instances in that class. Dcinter stands for the mean distance between the cth class prototype Ac and all other classes. We compare our proposed model with TSTD [59], which is the best competitor under SS protocol on AWA1 dataset. For consideration of fairness, we re-implement the experiments of the two methods under the same settings in AWA1, i.e., use ResNet-101 features and take continuous class-level attributes as semantic representations. Different from TSTD [59] that applies the attributes of unseen classes during training to improve the performance, BSAE obtains smaller intra-class distances and larger inter-class distances with a large margin as illustrated in Table 6. Thus, BSAE is capable of alleviate domain shift problem as well as hubness problem in lower dimensional semantic space.
Class Name | Dcintra | Dcinter | |||
BSAE | TSTD | BSAE | TSTD | ||
chimpanzee | 0.275 | 0.444 | 1.663 | 1.361 | |
giant+panda | 0.441 | 0.442 | 1.583 | 1.264 | |
leopard | 0.36 | 0.39 | 1.69 | 1.246 | |
persian+cat | 0.077 | 0.388 | 1.887 | 1.501 | |
pig | 0.199 | 0.43 | 1.868 | 1.399 | |
hippopotamus | 0.412 | 0.498 | 1.594 | 1.337 | |
humpback+whale | 0.318 | 0.35 | 1.28 | 1.257 | |
raccoon | 0.224 | 0.333 | 1.326 | 1.192 | |
rat | 0.215 | 0.28 | 1.28 | 1.234 | |
seal | 0.257 | 0.228 | 1.31 | 1.243 |
In this section, we analyze the complexity and convergence of BSAE. It is remarkable that the operation of Algorithm 1 mostly comes from matrix multiplication. Obviously, it can accelerate the training process greatly. Additionally, we set 200 as the maximum iterations. The F-norm of the parameter variation with respect to the iteration on fine-grained datasets are reported.
From Figure 7a, it is notable that our model reaches 80% of the accuracy within 4 iterations and is close to the highest accuracy around 10 iterations on coarse-grained dataset, e.g., AWA1 and around 20 iterations on fine-grained datasets. These demonstrate that our algorithm has a good practical application for its low complexity and good performance.
Figure 7b and 7c shows that the algorithm converges within 40 iterations. It is obvious that the decoder W2 of the target domain is well restricted by the decoder W1 of the source domain, which verifies the significance of the adaptation regularization. Moreover, these observations finally support the theoretical analysis of complexity and convergence in Section 3.3.
To provide further insights into the role of the two regularization terms: ‖W0Xs−Str‖2F and ‖W2−W1‖2F in our proposed objective function in helping the model to achieve better performance, we simplify our full model BSAE with various stripped-down versions of the model on the PS protocol of CZSL. Specially, for λ1=0, when the similarity constraint of he predicted and actual semantic representations are not exploited to encoder, i.e, Eq 3.3 without ‖W0Xs−Str‖2F term (denoted BSAE-SR), BSAE degrades to only contain adaptation regularization. The encoder does not ensure that the learned semantic representations of each instance is close to its class prototype. For λ2=0, i.e., BSAE without the adaptation regularization ‖W2−W1‖2F (denoted BSAE-DR), the decoder in the target domain is not restricted to derive from the decoder in the source domain, which is supervised by the semantic prototypes of the source instances. Figure 8 shows clearly that the two terms contribute to the superior performance of proposed model. We achieved up to around 10% improvements on five datasets. It is reasonable to believe that the learning of semantics will help the learning of domain adaptation among seen and unseen classes.
In this paper, we have proposed a novel model called Bi-shifting Auto-Encoder to perform efficient zero-shot recognition in semantic and visual space by taking advantage of autoencoder network. Our model learns the generalizable and computationally fast projection functions in transductive settings, which leverages the labeled source data and the visual features of the unlabeled target data. In particular, to improve the discriminability of the semantic embeddings, the encoder is constrained by aligning the semantic representations of the labeled source instances with their corresponding prototypes of the seen classes. Furthermore, to guarantee the generalizability of the projected semantic representations, two different decoders reconstruct the visual features of the instances in source and target domain simultaneously with the adaptation regularization. Thus, our model recovers the interactions between visual features and semantics, and is able to alleviate the projection shift problem. Extensive experiments are conducted on five benchmark datasets and comparative evaluations demonstrate that our model yields superior performance on zero-shot learning. The major limitation of our model lies in the fact that each class is represented by one attribute prototype in the semantic space, which is insufficient to completely characterize the features of the class, resulting in the semantics of the instances may be misplaced from the class prototype. Therefore, our research work will put effort in exploring different types of semantic representations to investigate the relationships between classes, especially the subtle differences among the classes of fine-grained datasets. An additional limitation of this study is that the full set of unlabeled target instances are utilized, ignoring their distinctive effects on the model learning. A natural processing of this work is to explore the most useful unseen instances that facilitate the zero-shot classification.
This work was supported by the Fundamental Research Funds for the Central Universities.
The authors declare there is no conflicts of interest.
[1] | M. C. Mackey, L. Glass, Oscillation theory of differential equations with deviating arguments, Dekker, New York, 1987. |
[2] | Y. Kuang, Delay differential equations: With applications in population dynamics, Boston: Academic Press, 1993. |
[3] |
B. S. Lalli, B. G. Zhang, On a periodic delay population model, Quart. Appl. Math. , 52 (1994), 35–42. https://doi.org/10.1090/qam/1262316 doi: 10.1090/qam/1262316
![]() |
[4] |
S. H. Saker, S. Agarwal, Oscillation and global attractivity in a periodic Nicholson's blowflies model, Math. Comput. Model. , 35 (2002), 719–731. http://dx.doi.org/10.1016/S0895-7177(02)00043-2 doi: 10.1016/S0895-7177(02)00043-2
![]() |
[5] |
S. N. Chow, Remarks on one dimensional delay-differential equations, J. Math. Anal. Appl. , 41 (1973), 426–429. http://dx.doi.org/10.1016/0022-247X(73)90217-5 doi: 10.1016/0022-247X(73)90217-5
![]() |
[6] |
H. I. Freedman, J. Wu, Periodic solutions of single-species models with periodic delay, SIAM J. Math. Anal. , 23 (1992), 689–701. http://dx.doi.org/10.1137/0523035 doi: 10.1137/0523035
![]() |
[7] |
Y. Kuang, H. L. Smith, Periodic solutions of differential delay equations with threshold-type delays, oscillations and dynamics in delay equations, Contemp. Math. , 129 (1992), 153–176. http://dx.doi.org/10.1090/conm/129/1174140 doi: 10.1090/conm/129/1174140
![]() |
[8] |
Y. H. Fan, W. T. Li, L. L. Wang, Periodic solutions of delayed ratio-dependent predator-prey models with monotonic or nonmonotonic functional response, Nonlinear Anal. , 129 (1992), 153–176. https://doi.org/10.1016/S1468-1218(03)00036-1 doi: 10.1016/S1468-1218(03)00036-1
![]() |
[9] | D. Q. Jiang, J. J. Wei, B. Zhang, Positive periodic solutions of functional differential equations and population models, Electron. J. Differ. Eq. , 71 (2002), 1–13. |
[10] |
L. L. Wang, W. T. Li, Periodic solutions and permanence for a delayed nonautonomous ratio-dependent predator-prey model with Holling type functional response, J. Comput. Appl. Math., 162 (2004), 341–357. http://dx.doi.org/10.1016/j.cam.2003.06.005 doi: 10.1016/j.cam.2003.06.005
![]() |
[11] |
Y. S. Liu, Periodic boundary value problems for first order functional differential equations with impulse, J. Comput. Appl. Math., 223 (2009), 27–39. http://dx.doi.org/10.1016/j.cam.2007.12.015 doi: 10.1016/j.cam.2007.12.015
![]() |
[12] |
J. L. Li, J. H. Shen, New comparison results for impulsive functional differential equations, Appl. Math. Lett., 23 (2010), 487–493. http://dx.doi.org/10.1016/j.aml.2009.12.010 doi: 10.1016/j.aml.2009.12.010
![]() |
[13] |
J. R. Graef, L. J. Kong, Existence of multiple periodic solutions for first order functional differential equations, Math. Comput. Mod., 54 (2011), 2962–2968. http://dx.doi.org/10.1016/j.mcm.2011.07.018 doi: 10.1016/j.mcm.2011.07.018
![]() |
[14] |
X. M. Zhang, M. Q. Feng, Multi-parameter, impulsive effects and positive periodic solutions of first-order functional differential equations, Bound. Value Probl., 2015 (2015), 137. http://dx.doi.org/10.1186/s13661-015-0401-x doi: 10.1186/s13661-015-0401-x
![]() |
[15] |
S. S. Cheng, G. Zhang, Existence of positive periodic solutions for non-autonomous functional differential equations, Electron. J. Differ. Eq., 59 (2001), 1–8. http://dx.doi.org/10.1111/1468-0262.00185 doi: 10.1111/1468-0262.00185
![]() |
[16] |
H. Y. Wang, Positive periodic solutions of functional differential equations, J. Differential Equations, 202 (2004), 354–366. http://dx.doi.org/10.1016/j.jde.2004.02.018 doi: 10.1016/j.jde.2004.02.018
![]() |
[17] |
W. X. Liu, W. T. Li, Existence and uniqueness of positive periodic solutions of functional differential equations, J. Math. Anal. Appl., 293 (2004), 28–39. http://dx.doi.org/10.1016/j.jmaa.2003.12.012 doi: 10.1016/j.jmaa.2003.12.012
![]() |
[18] |
D. Hilbert, Ueber die gerade Linie als k¨urzeste Verbindung zweier Punkte, Math. Ann., 46 (1970), 91–96. http://dx.doi.org/10.1007/BF02096204 doi: 10.1007/BF02096204
![]() |
[19] |
G. Birkhoff, Extensions of Jentzsch's theorem, Trans. Amer. Math. Soc., 85 (1957), 219–227. http://dx.doi.org/10.2307/1992971 doi: 10.2307/1992971
![]() |
[20] |
F. Klein, Ueber die sogenannte nicht-euklidische geometrie, J. Math. Ann., 4 (1871), 573–625. http://dx.doi.org/10.1007/BF02100583 doi: 10.1007/BF02100583
![]() |
[21] |
P. J. Bushell, The Cayley-Hilbert metric and positive operators, Linear Algebra Appl., 84 (1986), 271–280. https://doi.org/10.1016/0024-3795(86)90319-8 doi: 10.1016/0024-3795(86)90319-8
![]() |
[22] |
M. J. Huang, C. Y. Huang, T. M. Tsai, Applications of Hilbert's projective metric to a class of positive nonlinear operators, Linear Algebra Appl., 413 (2006), 202–211. http://dx.doi.org/10.1016/j.laa.2005.08.024 doi: 10.1016/j.laa.2005.08.024
![]() |
[23] |
K. Koufany, Application of Hilbert's projective metric on symmetric cones, Acta Math. Sin., 22 (2006), 1467–1472. http://dx.doi.org/10.1007/s10114-005-0755-6 doi: 10.1007/s10114-005-0755-6
![]() |
[24] | D. J. Guo, V. Lakshmikantham, Nonlinear problems in abstract cones, New York: Academic Press, 1988. https://doi.org/10.1016/C2013-0-10750-7 |
[25] |
P. J. Bushell, Hilbert's metric and positive contraction mappings in a Banach space, Arch. Ration. Mech. Anal., 52 (1973), 330–338. http://dx.doi.org/10.1007/BF00247467 doi: 10.1007/BF00247467
![]() |
[26] |
P. J. Bushell, On a class of Volterra and Fredholm nonlinear integral equations, Math. Proc. Camb. Phil. Soc., 79 (1976), 329–335. http://dx.doi.org/10.1017/s0305004100052324 doi: 10.1017/s0305004100052324
![]() |
[27] |
H. Amann, Fixed point equations and nonlinear eigenvalue problems in ordered Banach spaces, SIAM Review, 18 (1976), 620–709. http://dx.doi.org/10.1137/1018114 doi: 10.1137/1018114
![]() |
[28] |
A. J. B. Potter, Existence theorem for a non-linear integral equation, J. London Math. Soc., 1 (1975), 7–10. http://dx.doi.org/10.1112/jlms/s2-11.1.7 doi: 10.1112/jlms/s2-11.1.7
![]() |
[29] |
A. Meir, E. B. Keller, A theorem on contraction mappings, J. Math. Anal. Appl., 28 (1969), 326–329. http://dx.doi.org/10.1016/0022-247X(69)90031-6 doi: 10.1016/0022-247X(69)90031-6
![]() |
Notations | Descriptions |
Xs∈Rd×m | Visual feature of source instances |
Xt∈Rd×p | Visual feature of target instances |
Ys={ysi}mi=1 | Class labels of Xs |
Yt={yti}pi=1 | Class labels of Xt |
Cs={1,2,…,τ} | Set of τ seen classes |
Ct={1,2,…,μ} | Set of μ unseen classes |
At∈Rk×μ | Semantic prototypes of Ct |
Str∈Rk×m | Semantic representations of Xs |
Ste∈Rk×p | Semantic representations of Xt |
λ1,λ2 | Hyper-parameters |
X | d-dimensional visual feature space |
A | k-dimensional semantic space |
m,p | Number of source and target instances |
d,k | Dimensionality of visual feature and semantic space |
Algorithm 1 Bi-shifting Semantic Auto-Encoder |
Input: Training data Xs, Str Test data Xt Hyper-parameters λ1, λ2 Output: Projection matrices W0, W2 1: Initialize W2 2: while not converge do 3: Update W0 by Eq 3.9 4: Update W1 by Eq 3.11 5: Update W2 by Eq 3.13 6: Check the converge condition 7: end while 8: Return W0, W2 |
At Training Time | At Testing Time | |||||||||
Datasets | Granularity | Size | Attributes | Cs/Ct | Images | SS (Cs) | PS (Cs) | SS (Ct) | PS (Ct) | |
AWA1 | coarse | medium | 85 | 40/10 | 30,475 | 24,295 | 19,832 | 6180 | 10,643 | |
AWA2 | coarse | medium | 85 | 40/10 | 37,322 | 30,337 | 23,527 | 6985 | 13,795 | |
aP & Y | coarse | small | 64 | 20/12 | 15,339 | 12,695 | 5932 | 2644 | 9407 | |
CUB | fine | medium | 312 | 150/50 | 11,788 | 8855 | 7057 | 2933 | 4731 | |
SUN | fine | medium | 102 | 645/72 | 14,340 | 12,900 | 10,320 | 1440 | 4020 |
Method | Feature | AWA1 | AWA2 | SUN | CUB | aP & Y | Average |
DAP [27] | R | 57.1 | 58.7 | 38.9 | 37.5 | 35.2 | 45.5 |
IAP [27] | R | 48.1 | 46.9 | 17.4 | 27.1 | 22.4 | 32.4 |
CONSE [37] | R | 63.6 | 67.9 | 44.2 | 36.7 | 25.9 | 47.7 |
DEVISE [38] | R | 72.9 | 68.6 | 57.5 | 53.2 | 35.4 | 57.5 |
CMT [39] | R | 58.9 | 66.3 | 41.9 | 37.3 | 26.9 | 46.3 |
SSE [67] | R | 68.8 | 67.5 | 54.5 | 43.7 | 31.1 | 53.1 |
SJE [41] | R | 76.7 | 69.5 | 57.1 | 55.3 | 32 | 58.1 |
ESZSL [47] | R | 74.7 | 75.6 | 57.3 | 55.1 | 34.4 | 59.4 |
LATEM [42] | R | 74.8 | 68.7 | 56.9 | 49.4 | 34.5 | 56.9 |
ALE [58] | R | 78.6 | 80.3 | 59.1 | 53.2 | 30.9 | 60.4 |
SYNC [49] | R | 72.2 | 71.2 | 59.1 | 54.1 | 39.7 | 59.3 |
G | 72.9 | - | 62.7 | 54.7 | - | - | |
SAE [48] | R | 80.6 | 80.7 | 42.4 | 33.4 | 8.3 | 49.1 |
G | 81.9 | - | 59.7 | 53.6 | 34.5 | - | |
SSZSL [79] | V | 88.6 | - | - | 58.8 | 49.9 | - |
DSRL [57] | V | 87.2 | - | - | 57.1 | 56.3 | - |
STZSL [80] | V | 83.7 | - | - | 58.7 | 54.4 | - |
GFZSL [62] | R | 80.5 | 79.3 | 62.9 | 53 | 51.3 | 65.4 |
TSTD [59] | V | 90.3 | - | - | 58.2 | - | - |
QFSL [61] | R | - | 84.8 | 61.7 | 69.7 | - | - |
VCL [56] | R | 82.0 | 82.5 | 63.8 | 60.1 | - | - |
DEARF [60] | R | 81 | 81.2 | 64.3 | 56.1 | - | - |
BSAE | R | 90.7 | 85.2 | 60.4 | 61.8 | 63.2 | 72.3 |
G | 91 | - | 67.1 | 62.4 | 57.7 | - |
Method | AWA1 | AWA2 | CUB | SUN | aP & Y | Average |
DAP [27] | 44.1 | 46.1 | 40 | 39.9 | 33.8 | 40.8 |
IAP [27] | 35.9 | 35.9 | 24 | 19.4 | 36.6 | 30.4 |
CONSE [37] | 45.6 | 44.5 | 34.3 | 38.8 | 26.9 | 38 |
DEVISE [38] | 54.2 | 59.7 | 52 | 56.5 | 39.8 | 52.4 |
CMT [39] | 39.5 | 37.9 | 34.6 | 33.9 | 28 | 34.8 |
SSE [67] | 60.1 | 61 | 43.9 | 51.5 | 34 | 50.1 |
SJE [41] | 65.6 | 61.9 | 53.9 | 53.7 | 32.9 | 53.6 |
ESZSL [47] | 58.2 | 58.6 | 53.9 | 54.5 | 38.3 | 52.7 |
LATEM [42] | 55.1 | 55.8 | 49.3 | 55.3 | 35.2 | 50.1 |
ALE [58] | 59.9 | 62.5 | 54.9 | 58.1 | 39.7 | 55 |
SYNC [49] | 54 | 46.6 | 55.6 | 56.3 | 23.9 | 47.3 |
SAE [48] | 53 | 54.1 | 33.3 | 40.3 | 8.3 | 37.8 |
DEM [70] | 68.4 | 67.1 | 51.7 | 61.9 | 35 | 56.8 |
GFZSL [62] | 68.3 | 63.8 | 49.3 | 60.6 | 38.4 | 56.1 |
CAVE [55] | 71.4 | 65.8 | 52.1 | 61.7 | - | - |
PSR [81] | - | 63.8 | 56 | 61.4 | 38.4 | - |
TVN [82] | - | 68.8 | 58.1 | 60.7 | - | - |
GAZSL [54] | - | 68.4 | 55.8 | 61.3 | 41.1 | - |
f-CLSWGAN [50] | - | 68.8 | 57.3 | 60.8 | 40.5 | - |
GDAN [51] | - | 67.7 | 51 | 54.8 | 40.4 | - |
LESAE [66] | 66.1 | 68.4 | 53.9 | 60 | 40.8 | 57.8 |
LisGAN [68] | 70.6 | - | 58.8 | 61.7 | 43.1 | - |
SRGAN [53] | 71.9 | - | 55.4 | 62.2 | - | - |
DEARF [60] | 72.1 | 69.3 | 38.5 | 48.6 | - | - |
BSR [52] | - | 68.4 | 57.7 | 61.2 | 41.3 | - |
BSAE | 72.3 | 69.4 | 59.7 | 58.6 | 53 | 62.6 |
Grasshopper Sparrow 50.5% | Pacific Loon 88.2% | Rhinoceros Auklet 86.8% | Western Grebe 85.7% | Least Auklet 84.8% | market outdoor 82.6% | recycling plant outdoor 37.8% | van interior 44.7% | subway station platform 54.8% | lecture room 54.5% |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Class Name | Dcintra | Dcinter | |||
BSAE | TSTD | BSAE | TSTD | ||
chimpanzee | 0.275 | 0.444 | 1.663 | 1.361 | |
giant+panda | 0.441 | 0.442 | 1.583 | 1.264 | |
leopard | 0.36 | 0.39 | 1.69 | 1.246 | |
persian+cat | 0.077 | 0.388 | 1.887 | 1.501 | |
pig | 0.199 | 0.43 | 1.868 | 1.399 | |
hippopotamus | 0.412 | 0.498 | 1.594 | 1.337 | |
humpback+whale | 0.318 | 0.35 | 1.28 | 1.257 | |
raccoon | 0.224 | 0.333 | 1.326 | 1.192 | |
rat | 0.215 | 0.28 | 1.28 | 1.234 | |
seal | 0.257 | 0.228 | 1.31 | 1.243 |
Notations | Descriptions |
Xs∈Rd×m | Visual feature of source instances |
Xt∈Rd×p | Visual feature of target instances |
Ys={ysi}mi=1 | Class labels of Xs |
Yt={yti}pi=1 | Class labels of Xt |
Cs={1,2,…,τ} | Set of τ seen classes |
Ct={1,2,…,μ} | Set of μ unseen classes |
At∈Rk×μ | Semantic prototypes of Ct |
Str∈Rk×m | Semantic representations of Xs |
Ste∈Rk×p | Semantic representations of Xt |
λ1,λ2 | Hyper-parameters |
X | d-dimensional visual feature space |
A | k-dimensional semantic space |
m,p | Number of source and target instances |
d,k | Dimensionality of visual feature and semantic space |
Algorithm 1 Bi-shifting Semantic Auto-Encoder |
Input: Training data Xs, Str Test data Xt Hyper-parameters λ1, λ2 Output: Projection matrices W0, W2 1: Initialize W2 2: while not converge do 3: Update W0 by Eq 3.9 4: Update W1 by Eq 3.11 5: Update W2 by Eq 3.13 6: Check the converge condition 7: end while 8: Return W0, W2 |
At Training Time | At Testing Time | |||||||||
Datasets | Granularity | Size | Attributes | Cs/Ct | Images | SS (Cs) | PS (Cs) | SS (Ct) | PS (Ct) | |
AWA1 | coarse | medium | 85 | 40/10 | 30,475 | 24,295 | 19,832 | 6180 | 10,643 | |
AWA2 | coarse | medium | 85 | 40/10 | 37,322 | 30,337 | 23,527 | 6985 | 13,795 | |
aP & Y | coarse | small | 64 | 20/12 | 15,339 | 12,695 | 5932 | 2644 | 9407 | |
CUB | fine | medium | 312 | 150/50 | 11,788 | 8855 | 7057 | 2933 | 4731 | |
SUN | fine | medium | 102 | 645/72 | 14,340 | 12,900 | 10,320 | 1440 | 4020 |
Method | Feature | AWA1 | AWA2 | SUN | CUB | aP & Y | Average |
DAP [27] | R | 57.1 | 58.7 | 38.9 | 37.5 | 35.2 | 45.5 |
IAP [27] | R | 48.1 | 46.9 | 17.4 | 27.1 | 22.4 | 32.4 |
CONSE [37] | R | 63.6 | 67.9 | 44.2 | 36.7 | 25.9 | 47.7 |
DEVISE [38] | R | 72.9 | 68.6 | 57.5 | 53.2 | 35.4 | 57.5 |
CMT [39] | R | 58.9 | 66.3 | 41.9 | 37.3 | 26.9 | 46.3 |
SSE [67] | R | 68.8 | 67.5 | 54.5 | 43.7 | 31.1 | 53.1 |
SJE [41] | R | 76.7 | 69.5 | 57.1 | 55.3 | 32 | 58.1 |
ESZSL [47] | R | 74.7 | 75.6 | 57.3 | 55.1 | 34.4 | 59.4 |
LATEM [42] | R | 74.8 | 68.7 | 56.9 | 49.4 | 34.5 | 56.9 |
ALE [58] | R | 78.6 | 80.3 | 59.1 | 53.2 | 30.9 | 60.4 |
SYNC [49] | R | 72.2 | 71.2 | 59.1 | 54.1 | 39.7 | 59.3 |
G | 72.9 | - | 62.7 | 54.7 | - | - | |
SAE [48] | R | 80.6 | 80.7 | 42.4 | 33.4 | 8.3 | 49.1 |
G | 81.9 | - | 59.7 | 53.6 | 34.5 | - | |
SSZSL [79] | V | 88.6 | - | - | 58.8 | 49.9 | - |
DSRL [57] | V | 87.2 | - | - | 57.1 | 56.3 | - |
STZSL [80] | V | 83.7 | - | - | 58.7 | 54.4 | - |
GFZSL [62] | R | 80.5 | 79.3 | 62.9 | 53 | 51.3 | 65.4 |
TSTD [59] | V | 90.3 | - | - | 58.2 | - | - |
QFSL [61] | R | - | 84.8 | 61.7 | 69.7 | - | - |
VCL [56] | R | 82.0 | 82.5 | 63.8 | 60.1 | - | - |
DEARF [60] | R | 81 | 81.2 | 64.3 | 56.1 | - | - |
BSAE | R | 90.7 | 85.2 | 60.4 | 61.8 | 63.2 | 72.3 |
G | 91 | - | 67.1 | 62.4 | 57.7 | - |
Method | AWA1 | AWA2 | CUB | SUN | aP & Y | Average |
DAP [27] | 44.1 | 46.1 | 40 | 39.9 | 33.8 | 40.8 |
IAP [27] | 35.9 | 35.9 | 24 | 19.4 | 36.6 | 30.4 |
CONSE [37] | 45.6 | 44.5 | 34.3 | 38.8 | 26.9 | 38 |
DEVISE [38] | 54.2 | 59.7 | 52 | 56.5 | 39.8 | 52.4 |
CMT [39] | 39.5 | 37.9 | 34.6 | 33.9 | 28 | 34.8 |
SSE [67] | 60.1 | 61 | 43.9 | 51.5 | 34 | 50.1 |
SJE [41] | 65.6 | 61.9 | 53.9 | 53.7 | 32.9 | 53.6 |
ESZSL [47] | 58.2 | 58.6 | 53.9 | 54.5 | 38.3 | 52.7 |
LATEM [42] | 55.1 | 55.8 | 49.3 | 55.3 | 35.2 | 50.1 |
ALE [58] | 59.9 | 62.5 | 54.9 | 58.1 | 39.7 | 55 |
SYNC [49] | 54 | 46.6 | 55.6 | 56.3 | 23.9 | 47.3 |
SAE [48] | 53 | 54.1 | 33.3 | 40.3 | 8.3 | 37.8 |
DEM [70] | 68.4 | 67.1 | 51.7 | 61.9 | 35 | 56.8 |
GFZSL [62] | 68.3 | 63.8 | 49.3 | 60.6 | 38.4 | 56.1 |
CAVE [55] | 71.4 | 65.8 | 52.1 | 61.7 | - | - |
PSR [81] | - | 63.8 | 56 | 61.4 | 38.4 | - |
TVN [82] | - | 68.8 | 58.1 | 60.7 | - | - |
GAZSL [54] | - | 68.4 | 55.8 | 61.3 | 41.1 | - |
f-CLSWGAN [50] | - | 68.8 | 57.3 | 60.8 | 40.5 | - |
GDAN [51] | - | 67.7 | 51 | 54.8 | 40.4 | - |
LESAE [66] | 66.1 | 68.4 | 53.9 | 60 | 40.8 | 57.8 |
LisGAN [68] | 70.6 | - | 58.8 | 61.7 | 43.1 | - |
SRGAN [53] | 71.9 | - | 55.4 | 62.2 | - | - |
DEARF [60] | 72.1 | 69.3 | 38.5 | 48.6 | - | - |
BSR [52] | - | 68.4 | 57.7 | 61.2 | 41.3 | - |
BSAE | 72.3 | 69.4 | 59.7 | 58.6 | 53 | 62.6 |
Grasshopper Sparrow 50.5% | Pacific Loon 88.2% | Rhinoceros Auklet 86.8% | Western Grebe 85.7% | Least Auklet 84.8% | market outdoor 82.6% | recycling plant outdoor 37.8% | van interior 44.7% | subway station platform 54.8% | lecture room 54.5% |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Class Name | Dcintra | Dcinter | |||
BSAE | TSTD | BSAE | TSTD | ||
chimpanzee | 0.275 | 0.444 | 1.663 | 1.361 | |
giant+panda | 0.441 | 0.442 | 1.583 | 1.264 | |
leopard | 0.36 | 0.39 | 1.69 | 1.246 | |
persian+cat | 0.077 | 0.388 | 1.887 | 1.501 | |
pig | 0.199 | 0.43 | 1.868 | 1.399 | |
hippopotamus | 0.412 | 0.498 | 1.594 | 1.337 | |
humpback+whale | 0.318 | 0.35 | 1.28 | 1.257 | |
raccoon | 0.224 | 0.333 | 1.326 | 1.192 | |
rat | 0.215 | 0.28 | 1.28 | 1.234 | |
seal | 0.257 | 0.228 | 1.31 | 1.243 |