
General learning algorithms trained on a specific dataset often have difficulty generalizing effectively across different domains. In traditional pattern recognition, a classifier is typically trained on one dataset and then tested on another, assuming both datasets follow the same distribution. This assumption poses difficulty for the solution to be applied in real-world scenarios. The challenge of making a robust generalization from data originated from diverse sources is called the domain adaptation problem. Many studies have suggested solutions for mapping samples from two domains into a shared feature space and aligning their distributions. To achieve distribution alignment, minimizing the maximum mean discrepancy (MMD) between the feature distributions of the two domains has been proven effective. However, this alignment of features between two domains ignores the essential class-wise alignment, which is crucial for adaptation. To address the issue, this study introduced a discriminative, class-wise deep kernel-based MMD technique for unsupervised domain adaptation. Experimental findings demonstrated that the proposed approach not only aligns the data distribution of each class in both source and target domains, but it also enhances the adaptation outcomes.
Citation: Hsiau-Wen Lin, Yihjia Tsai, Hwei Jen Lin, Chen-Hsiang Yu, Meng-Hsing Liu. Unsupervised domain adaptation with deep network based on discriminative class-wise MMD[J]. AIMS Mathematics, 2024, 9(3): 6628-6647. doi: 10.3934/math.2024323
[1] | S. Neelakandan, Sathishkumar Veerappampalayam Easwaramoorthy, A. Chinnasamy, Jaehyuk Cho . Fuzzy adaptive learning control network (FALCN) for image clustering and content-based image retrieval on noisy dataset. AIMS Mathematics, 2023, 8(8): 18314-18338. doi: 10.3934/math.2023931 |
[2] | A. Joumad, A. El Moutaouakkil, A. Nasroallah, O. Boutkhoum, Mejdl Safran, Sultan Alfarhood, Imran Ashraf . Unsupervised segmentation of images using bi-dimensional pairwise Markov chains model. AIMS Mathematics, 2024, 9(11): 31057-31086. doi: 10.3934/math.20241498 |
[3] | Li He . Composition operators on Hardy-Sobolev spaces with bounded reproducing kernels. AIMS Mathematics, 2023, 8(2): 2708-2719. doi: 10.3934/math.2023142 |
[4] | Salman khan, Muhammad Naeem, Muhammad Qiyas . Deep intelligent predictive model for the identification of diabetes. AIMS Mathematics, 2023, 8(7): 16446-16462. doi: 10.3934/math.2023840 |
[5] | F. Z. Geng . Piecewise reproducing kernel-based symmetric collocation approach for linear stationary singularly perturbed problems. AIMS Mathematics, 2020, 5(6): 6020-6029. doi: 10.3934/math.2020385 |
[6] | Yassamine Chellouf, Banan Maayah, Shaher Momani, Ahmad Alawneh, Salam Alnabulsi . Numerical solution of fractional differential equations with temporal two-point BVPs using reproducing kernal Hilbert space method. AIMS Mathematics, 2021, 6(4): 3465-3485. doi: 10.3934/math.2021207 |
[7] | Xiaoda Xu . Bounds of random star discrepancy for HSFC-based sampling. AIMS Mathematics, 2025, 10(3): 5532-5551. doi: 10.3934/math.2025255 |
[8] | Ali Akgül, Esra Karatas Akgül, Sahin Korhan . New reproducing kernel functions in the reproducing kernel Sobolev spaces. AIMS Mathematics, 2020, 5(1): 482-496. doi: 10.3934/math.2020032 |
[9] | Yanyu Xiang, Aiping Wang . On symmetry of the product of two higher-order quasi-differential operators. AIMS Mathematics, 2023, 8(4): 9483-9505. doi: 10.3934/math.2023478 |
[10] | Bengisen Pekmen Geridonmez . RBF simulation of natural convection in a nanofluid-filled cavity. AIMS Mathematics, 2016, 1(3): 195-207. doi: 10.3934/Math.2016.3.195 |
General learning algorithms trained on a specific dataset often have difficulty generalizing effectively across different domains. In traditional pattern recognition, a classifier is typically trained on one dataset and then tested on another, assuming both datasets follow the same distribution. This assumption poses difficulty for the solution to be applied in real-world scenarios. The challenge of making a robust generalization from data originated from diverse sources is called the domain adaptation problem. Many studies have suggested solutions for mapping samples from two domains into a shared feature space and aligning their distributions. To achieve distribution alignment, minimizing the maximum mean discrepancy (MMD) between the feature distributions of the two domains has been proven effective. However, this alignment of features between two domains ignores the essential class-wise alignment, which is crucial for adaptation. To address the issue, this study introduced a discriminative, class-wise deep kernel-based MMD technique for unsupervised domain adaptation. Experimental findings demonstrated that the proposed approach not only aligns the data distribution of each class in both source and target domains, but it also enhances the adaptation outcomes.
Deep learning techniques have proven successful in various computer vision fields, such as image classification [1], object detection [2], and semantic segmentation [3]. However, the effectiveness of deep learning relies heavily on large, labeled training datasets, which could be labor-intensive to annotate. When dealing with large unlabeled datasets, it is often impractical to label enough data for training a deep learning model. An alternative approach is transfer learning, where labeled data from related domains (source domain) are utilized to enhance the model's performance in the domain of interest (target domain). Transfer learning is the process of applying knowledge learned from a labeled source domain to a target domain, where labeled data may be limited or unavailable.
Pan [4] classified transfer learning into three categories according to labeled data in the two domains used during training. They are (1) inductive transfer learning: When the target domain data is labeled, irrespective of whether the source domain data is labeled or not; (2) transductive transfer learning: When only the source domain data is labeled, while the target domain data remains unlabeled; and (3) unsupervised transfer learning: When both domains lack labels. Transductive transfer learning can be further divided into two types: (1) domain adaptation: When both domains use the same attributes but have different marginal probability distributions; and (2) sample selection bias: When the sample spaces or data types of the two domains are different, such as images in the source domain and text in the target domain. This paper focuses on unsupervised domain adaptation (UDA), which aims to minimize the distribution discrepancy between data from two domains, enabling successful knowledge transfer from the source domain to the target domain.
Currently, numerous domain adaptation methods have been researched and developed [4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23]. These methods fall into three main categories [4]: Instance reweighting methods [5,6,7], feature extraction methods [8,9,10,11], and classifier adaptive approaches [12,13]. Feature extraction methods aim to learn domain-invariant feature representations and are broadly categorized into two types [14]: Adversarial learning-based approaches [15,16,17] and statistics-based approaches [18,19,20]. Adversarial learning-based methods seek to achieve domain-invariant features by generating images or feature representations from different domains. For instance, the deep reconstruction classification network (DRCN) [21] establishes a classifier for labeled source domain data and constructs a domain-invariant feature representation shared with unlabeled target domain data. Statistical methods involve defining a suitable measure of difference or distance between two distinct distributions [18,24,25,26,27,28,29]. Various distance metrics, such as quadratic [30], Kullback-Leibler [31] and Mahalanobis [32], have been proposed over the years. However, these methods are not easily adaptable to different domain adaptation (DA) models and may not effectively describe complex distributions like conditional and joint distributions due to theoretical limitations. In recent years, the MMD [28], initially used for two-sample testing, has been found to be effective in calculating the distance between sample distributions from two domains in feature space. It facilitates alignment between the distributions by minimizing the MMD between them. The method presented in this paper falls under this category. Long et al. [9] introduced the regularization of MMD, utilizing it to reduce the distribution difference between the feature distributions of two domains in hidden layers of deep adaptation networks.
The use of MMD focuses mainly on aligning the overall distribution of two domains, but often falls short in ensuring precise alignment of data within the same category across domains. In response, Long et al. [19] proposed the class-wise maximum mean discrepancy (CWMMD) to enhance robust domain adaptation. The two-domain samples are linearly mapped into a common feature space and the MMD for each category is calculated, then summed to obtain a CWMMD. Wang et al. [33] highlighted that minimizing the MMD is equivalent to minimizing the overall data variance while simultaneously maximizing the intra-class distances of the source and target domains, leading to a decrease in feature discriminativeness. They adjusted balance parameters to mitigate this issue but were limited to linear transformations in the feature space. However, they used the L2 norm as the MMD estimator in a linearly transformed feature space. It is worth noting that the L2 norm is not well suited for general estimation [34,35], and that linear transformations may not adequately capture complex data relationships, especially if nonlinear mappings are required.
In contrast, deep neural networks, particularly convolutional neural networks (CNNs), learn powerful and expressive nonlinear transformations. This paper proposes a method to improve upon this, which involves training a CNN architecture, so that the model automatically learns feature representations that are well-suited for the task at hand. Furthermore, the loss function used in the domain adaptation process can be efficiently evaluated in a reproduced kernel hilbert space (RKHS). This facilitates effective alignment of data belonging to the same class from both the source and target domains in the shared feature space.
This section presents the related research, including pseudo labels and different variants of MMD.
Computing a class-level MMD during training requires the use of pseudo-labels for unlabeled target domain data. A simple way to generate pseudo labels is directly applying formula (1) to the source domain model [14]; that is, input the target sample xt into the source domain model f = C ∙ F, which comprises a feature extractor F and a classifier C, to obtain the softmax result of classification δ=(δ1,δ2,…,δC), and then set the index of the maximum of components in the output vector δ as the pseudo label for the target sample xt. However, due to domain shift, the pseudo-label generated by this method may have large bias. Instead, this study adopts another method, the self-supervised pseudo-label strategy proposed by Liang et al. [36]. The strategy first uses the current target domain model to calculate the centroid of each category for the target domain data, which is similar to weighted K-means clustering, as shown in formula (2). These centroids robustly represent the distribution of different classes in the target domain data. Next, the category of the nearest centroid for each target domain data is obtained as its pseudo label, as shown in formula (3), where Dcos(a, b) means the cosine distance between a and b. The new pseudo-label is then utilized to recalculate the centroid, as shown in formula (4), and update the pseudo-label again, as shown in formula (5). Finally, with the new pseudo labels, the target model is self-supervised using the cross-entropy loss function, as shown in (6).
ˆyt=argmax1≤k≤Cδk(f(xt)), | (1) |
C(0)k=∑xt∈Xtδk(f(xt))F(xt)∑xt∈Xtδk(f(xt)),k=1,2,…,C | (2) |
ˆyt=argmin1≤k≤CDcos(boldsymbolF(xt), C(0)k), | (3) |
C(1)k=∑xt∈Xt1(ˆyt=k)boldsymbolF(xt)∑xt∈Xt1(ˆyt=k), | (4) |
ˆyt=argmin1≤k≤CDcos(F(xt), C(1)k), | (5) |
LsslT(ft;Xt,ˆYt)=−E(x,ˆyt)∈Xs×^Yt∑Kk=11[k=ˆyt]logδk(f(x)). | (6) |
The MMD is a distance measure between feature means. Gretton et al. [28] introduced an MMD measure, which involves embedding distribution metrics in the RKHS and using it to conduct a two-sample test for detecting differences between two unknown distributions, p and q. The purpose of this test is to draw two sets of samples X and Y from these distributions and to determine whether p and q are different distributions. They applied a kernel-based MMD to two-sample tests on various problems and achieved excellent performance. Furthermore, these kernel-based MMDs have been shown to be consistent, asymptotically normal, robust to model misspecification, and have been successfully applied to various problems, including transfer learning [29], kernel Bayesian inference [37], approximate Bayesian computation [38], two-sample testing [28], optimal degree-of-fit testing [39], generating moment matching networks (GMMN) [40], and autoencoders [41].
Gaussian kernel-based MMDs are commonly used estimators, which have the key property of universality, allowing estimators to converge to the best approximation for generating distribution of the (unknown) data in the model, without making any assumptions about this distribution. In contrast, the L2 norm lacks the above properties, suffers from the curse of dimensionality, and is not suitable for universal estimation [34,35]. Furthermore, Gaussian kernel-based MMDs also serve as an effective measure for domain differences in UDA scenarios, and their computation is streamlined by applying a kernel function directly to the samples.
The squared MMD in an RKHS, denoted as ‖μp−μq‖2H, can be straightforwardly expressed using kernel functions. Additionally, it is possible to easily derive an unbiased estimate for finite samples. Considering independent random variables x and x' from distribution p, as well as independent random variables y and y' from distribution q, let H be a universal RKHS with unit ball denoted F, and one can give the squared MMD, as shown in (7) [32], where ϕ(⋅) is a function mapping the samples to H, μp=Ex∼p[ϕ(x)] and μq=Ey∼q[ϕ(y)] representing the kernel mean embeddings, and k is set to the commonly used Gaussian kernel, as shown in (8). An unbiased empirical estimate is given in (9), where X={x1,…,xm} and Y={y1,…,yn} are two sets randomly sampled from two probability distributions p and q, respectively. However, there is no definitive method for selecting the bandwidth σ of the kernel k in (9). Gretton et al. [28] suggested using the median distance between samples as the bandwidth, but did not verify that this choice is optimal.
(MMD(F,p,q))2=‖μp−μq‖2H=⟨μp−μq,μp−μq⟩H=⟨μp,μp⟩H+⟨μq,μq⟩H−2⟨μp,μq⟩H |
=Ex,x'∼p[⟨ϕ(x),ϕ(x')⟩H]+Ey,y'∼q[⟨ϕ(y),ϕ(y')⟩H]−2Ex∼p,y∼q[⟨ϕ(x),ϕ(y)⟩H] |
=Ex,x'∼p[k(x,x')]+Ey,y'∼q[k(y,y')]−2Ex∼p,y∼q[k(x,y)], | (7) |
k(a,b)=exp(−‖a−b‖222σ2ϕ), | (8) |
(MMDu(F,X,Y))2=1m(m−1)∑mi≠jk(xi,xj)+1n(n−1)∑ni≠jk(yi,yj)−2mn∑m,ni,jk(xi,yj). | (9) |
Although MMD has been commonly used in cross-domain problems, minimizing the MMD between the samples from two domains only narrows their marginal distributions. Long et al. [8] proposed joint distribution adaptation (JDA), which jointly adapts marginal and conditional distributions in a reduced-dimensional principal component space and constructs new feature representations. They adopted a principle component analysis (PCA) transformation and minimized the Euclidean distance between the sample means of the two domains in a reduced-dimensional principal component space. They referred to this method as MMD for marginal distribution, as shown in formula (10), where Xs ∈Rd×ns and Xt∈Rd×nt are the samples from the source and target domains, respectively. Here, ns and nt are the numbers of samples from the source and target domains, d is the sample dimension, and ϕ represents a linear transformation function. Let A be the d×K standard matrix of the linear transformation function ϕ. Formula (10) can then be rewritten as formula (11), where Xst=[Xs|Xt] and M0∈Rnst×nst are calculated as shown in formula (12).
MMD2=||1ns∑xi∈Xsϕ(xi)−1nt∑xj∈Xtϕ(xj)||22, | (10) |
MMD2=||1ns∑xi∈XsATxi−1nt∑xj∈XtATxj||22=tr(ATXstM0XTstA), | (11) |
(M0)ij={1nsns,xi,xj∈Xs1ntnt,xi,xj∈Xt−1nsnt,otherwise. | (12) |
In addition to the MMD for marginal distribution, they proposed also the MMD for conditional distribution. However, during empirical estimation, to obtain samples for each category, labels for target domain samples that do not exist need to be provided. To address this, they suggested using pseudo labels for the target samples, which can be obtained either from the current classifier trained on the samples from the source domain or through other methods. They named the MMD for conditional distribution as class-wise MMD (CWMMD), as shown in Eq (13). Here, Xcs and Xct represent the data samples of the cth category from the source and target domains, respectively, while ncs and ncs are the respective sample sizes and the calculation of Mc∈Rncst×ncst is detailed in (14).
CWMMD2=∑Cc=1||1ncs∑xi∈XcsATxi−1nct∑xj∈XctATxj||22=∑Cc=1tr(ATXstMcXTstA), | (13) |
(Mc)ij={1ncsncs,xi,xj∈Xcs1nctnct,xi,xj∈Xct−1ncsnct,xi∈Xcs,xj∈Xct or xj∈Xcs,xi∈Xct0,otherwise. | (14) |
In JDA, both marginal distribution discrepancy and conditional distribution discrepancy across domains are simultaneously minimized. Consequently, the optimization problem for JDA is resolved by combining Eqs (11) and (13), as shown in Eq (15). In (15), the constraint condition ATXstHstXTstA=IK×K limits the overall data variation to a fixed value, ensuring that data information on the subspace is statistically retained to some extent. ‖A‖2F controls the size of the matrix A, and α is a regularization parameter ensuring a well-defined optimization problem. It is important to note that Hst=Inst×nst−1nst1nst×nst is a centering matrix, where ncst=ncs+ncs and 1nst×nst is a matrix of size nst×nst with all elements being one.
minA∑Cc=0tr(ATXstMcXTstA)+α‖A‖2Fs.t.ATXstHstXTstA=IK×K. | (15) |
Wang et al. [33] proposed an insight into the working principle of MMD and theoretically revealed its high degree of agreement with human transferable behavior. In Figure 1 [33], when considering a pair of classes labeled "desktop computers", respectively, from the source and target domains, the process of minimizing the MMD between these two distributions involves two key transformations. They are (1) two relatively small red circles (hollow and mesh circles) transformed into larger red ones, which are magnified; i.e., maximizing their specific intra-class distance and (2) two tiny red circles gradually moving closer along their specific arrows; i.e., minimizing their joint variance. This process is analogous to how humans abstract common features to encompass all possible appearances, but the detailed information is heavily decayed. Wang et al. also theoretically demonstrated this insight.
Let (S(A,X))cinter = tr(ATXstMcXTstA) denote the inter-class distance (i.e., square of MMD) between the cth class data in the source domain and the target domain in the transformation space according to the transformation matrix A, and let Sinter=∑Cc=1(S(A,X))cinter, then (15) is written as (16). Wang et al. derived Sinter=Svar−Sintra, so (16) is written as (17), where Sintra represents the intra-class distance, and Svar is the variance of the entire data. Therefore, minimizing the inter-class distance Sinter is equivalent to maximizing their variation Svar, and maximizing the intra-class distance Sintra at the same time, which will reduce feature discriminativeness. To address this, a trade-off parameter is introduced to adjust the hidden intra-class distance in Sinter, as shown in Eq (18). They obtained an optimal linear transformation matrix A, thus minimizing the loss evaluated in this transformation space.
minA[Sinter+MMD2+α‖A‖2F]s.t.ATXstHstXTstA=Ik×k, | (16) |
minA[Svar−Sintra+MMD2+α‖A‖2F]s.t.ATXstHstXTstA=Ik×k, | (17) |
minA[Svar+β∙Sintra+MMD2+α‖A‖2F]s.t.ATXstHstXTstA=Ik×k. | (18) |
The unsupervised domain adaptation training proposed in this paper focuses on using discriminative CWMMD (DCWMMD) to align data of the same class between the source and target domains. By alleviating the problem of MMD through reducing feature discriminativeness while minimizing the mean difference between the two domains, the proposed method effectively achieves the goal of unsupervised domain adaptation.
Unlike Wang et al. [33], who used the L2-norm as an MMD estimator in the linearly transformed feature space, this study employed a network to train a feature space. Samples in this space are then projected into an RKHS to efficiently evaluate and minimize the loss function. The Gaussian kernel is commonly used because the RKHS with the Gaussian kernel is guaranteed to be universal [30]. Wang et al. proposed that the inter-class distance is equal to the variation minus the intraclass distance under the MMD they defined. This section reformulates the interclass distance, intra-class distance, and variation as defined by Wang et al. and adopts the MMD with the Gaussian kernel. Moreover, it provides a proof that when using this MMD to measure the distance of distribution of samples from two domains, the interclass distance is indeed equal to the variation minus the intra-class distance.
This study considers only two domains, one source domain and one target domain. Xcs and Xct represent the sample sets of class c from the source domain and the target domain, respectively, and Xc (or Xcst) represents the union of all sample sets of class c in both domains, i.e., Xc=Xcst=Xcs∪Xct. More symbols and notations are presented in a nomenclature table provided in Table 1.
symbol | meaning |
Xcs | set of samples of class c from source domain |
Xct | set of samples of class c from target domain |
Xc(=Xcst) | Xc=Xcs∪Xct, set of samples of class c from source and target domains |
Xs | Xs=⋃Cc=1Xcs, set of samples from source domain |
Xt | Xt=⋃Cc=1Xct, set of samples from target domain |
X(=Xst) | X=Xs∪Xt, set of samples from source and target domains |
ncs | ‖Xcs‖, number of samples in Xcs |
nct | ‖Xct‖, number of samples in Xct |
nc(=ncst) | ‖Xc‖=ncs+nct, number of samples in Xc |
ns | ‖Xs‖, number of samples in Xs |
nt | ‖Xt‖, number of samples in Xt |
n(=nst) | ‖X‖=∑Cc=1nc=ns+nt, number of samples in X |
mcs | (1/ncs)∑xi∈Xcsxi, mean of Xcs |
mct | (1/nct)∑xi∈Xctxi, mean of Xct |
mc(=mcst) | (1/nc)∑xi∈Xcxi, mean of Xc |
ms | (1/ns)∑xi∈Xsxi, mean of Xs |
mt | (1/nt)∑xi∈Xtxi, mean of Xt |
m(=mst) | (1/n)∑xi∈Xxi, mean of X |
This subsection uses the RKHS-based MMD and leverages the kernel trick to efficiently compute the interclass distance, intra-class distance, and variation between samples from the two domains. Moreover, it demonstrates that when using the Gaussian kernel-based MMD, the inter-class distance can be decomposed into their respective intraclass distances and variations.
Definition 3.1. Interclass distance.
The square of the interclass distance between the samples from the source domain and the target domain is defined as Sinter=∑Cc=1(S)cinter, where (S)cinter=(Sst)cinter is the square of the interclass distance (or MMD) between the samples of class c from the source domain and the target domain, as shown in (19), which is derived as the forms in (20) and (21).
(S)cinter=k(mcs−mct,mcs−mct), | (19) |
(S)cinter=[ncs+nctnctk(mcs−mcst,mcs−mcst)+ncs+nctncsk(mct−mcst,mct−mcst], | (20) |
(S)cinter=ncs+nctncsnct(ncsk(mcs−mcst,mcs−mcst)+nctk(mct−mcst,mct−mcst). | (21) |
Definition 3.2. Intraclass distance.
The square of the intra-class distance between the samples from the source domain and the target domain is defined as Sintra=∑Cc=1(S)cintra, where (S)cintra=(Sst)cintra is the square of the intraclass distance of the samples of class c from the two domains, as defined in (22).
(S)cintra=ncs+nctncsnct(∑xi∈Xcsk(xi−mcs,xi−mcs)+∑xj∈Xctk(xj−mct,xj−mct). | (22) |
Definition 3.3. Variance.
The joint variance of the samples from the source domain and the target domain is defined as Svar=∑Cc=1(S)cvar, where (S)cvar=(Sst)cvar is the joint variance of the samples of class c from the source domain and the target domain, as shown in (23).
(S)cvar=(Sst)cvar=ncs+nctncsnct∑xi∈Xcstk(xi−mcst,xi−mcst). | (23) |
The square of the MMD between samples Xs and Xt from the source domain and the target domain, respectively is defined using (24) or derived as (25). Conceptually, this is equivalent to treating the samples from both domains as belonging to the same class and computing the interclass distance. This can be expressed as (S)ointer=(Sst)ointer=(MMD(Xs,Xt))2 or (MMDu(Xs,Xt))2 when unbiased estimation is employed, as demonstrated in formulas (26) and (27), respectively.
(MMD(Xs,Xt))2=k(ms−mt,ms−mt) |
=ns+ntntk(ms−mst,ms−mst)+ns+ntnsk(mt−mst,mt−mst, | (24) |
(MMD(Xs,Xt))2=ns+ntnsnt(nsk(ms−mst,ms−mst)+ntk(mt−mst,mt−mst), | (25) |
(MMD(Xs,Xt))2=1(ns)2∑xi∈Xs∑xj∈Xsk(xi,xj) |
+1(nt)2k(xi,xj)−2nsnt∑xi∈Xs∑xj∈Xtk(xi,xj) | (26) |
(MMDu(Xs,Xt))2=1ns(ns−1)∑xi,xj∈Xsxi≠xjk(xi,xj) |
+1nt(nt−1)∑xi,xj∈Xtxi≠xjk(xi,xj)−2nsnt∑xi∈Xs∑xj∈Xtk(xi,xj). | (27) |
Theorem 3.1. The square of the interclass distance equals the data variance minus the square of the intra-class distance; that is, Sinter=Svar−Sintra.
Proof. For (S)cinter=ncs+nctncsnct(ncsk(mcs−mcst,mcs−mcst)+nctk(mct−mcst,mct−mcst)), (S)cvar=ncs+nctncsnct∑xi∈Xcstk(xi−mcst,xi−mcst), and (S)cintra=ncs+nctncsnct(∑xi∈Xcsk(xi−mcs,xi−mcs)+∑xj∈Xctk(xj−mct,xj−mct)), it is sufficient to prove that (S)cinter+(S)cintra=(S)cvar or that ncsdk(mcsd−mc,mcsd−mc)+∑xi∈Xcsdk(xi−mcsd,xi−mcsd)=∑xi∈Xcsdk(xi−mc,xi−mc) for 1≤c≤C and sd∈{s,t}. Since ncsdk(mcsd−mc,mcsd−mc)=∑xi∈Xcsdk(mcsd−mc,mcsd−mc)=∑xi∈Xcsd(k(mcsd,mcsd)+k(mc,mc)−2k(mcsd,mc))=∑xi∈Xcsd(k(mcsd,mcsd)+k(mc,mc)−2k(xi,mc)) and ∑xi∈Xcsdk(xi−mcsd,xi−mcsd)=∑xi∈Xcsd(k(xi,xi)+k(mcsd,mcsd)−2k(xi,mcsd))=∑xi∈Xcsd(k(xi,xi)+k(mcsd,mcsd)−2k(mcsd,mcsd))=∑xi∈Xcsd(k(xi,xi)−k(mcsd,mcsd)), we have ncsdk(mcsd−mc,mcsd−mc)+∑xi∈Xcsdk(xi−mcsd,xi−mcsd)= ∑xi∈Xcsd(k(mcsd,mcsd)+k(mc,mc)−2k(xi,mc)+k(xi,xi)−k(mcsd,mcsd))=∑xi∈Xcsdk(xi−mc,xi−mc). This completes the proof.
Theorem 3.1 indicates that minimizing the interclass distance is equivalent to minimizing their variation, while simultaneously maximizing the intra-class distance, thus reducing feature discriminativeness. To address this, the strategy proposed by Wang et al. [33] was adopted with a trade-off parameter β (−1≤β≤1) introduced to adjust the hidden intra-class distance within Sinter, resulting in the formulation of the discriminative class-level loss function, denoted as Ldcwmmd in formula (28), and its expansion is given in formula (29).
Ldcwmmd=Svar+β∙Sintra+MMD2(Xs,Xt) |
=∑Cc=1(Sst)cvar+β∙∑Cc=1(Sst)cintra+(Sst)0inter, | (28) |
Ldcwmmd= |
∑Cc=1ncs+nctncsnct(∑xj∈Xcst⟨xj−mcst,xj−mcst⟩H+β∑xi∈Xcs⟨xi−mcs,xi−mcs⟩H+β∑xj∈Xct⟨xj−mct,xj−mct⟩H)+ns+ntnt<ms−mst,ms−mst>+ns+ntns<mt−mst,mt−mst>. | (29) |
The terms (Sst)cinter, (Sst)cintra, and (Sst)cvar, defined in Definitions 3.1 to 3.3, can be expressed in terms of individual sample representations xj's using formulas (30) to (32). To ensure unbiased estimation and calculate deviations in the feature space, this study uses the loss function Ludcwmmd, represented by the feature representations zj's of the samples xj's, as shown in (28). By setting α1=(β+1)(ncs+nct)ncsnct, α2=−(ncs+(ncs+nct)β(ncs)2nct), α3=−(nct+(ncs+nct)βncs(nct)2), α4=−(2ncsnct), γ1=1ns(ns−1), γ2=1nt(nt−1), and γ3=−2nsnt, the simplified form of Ludcwmmd is given in formula (34). During the training process, these scalar values can be precomputed and stored, eliminating the need for subsequent recalculation.
(Sst)cinter=1(ncs)2∑xi,xj∈Xcs⟨xi,xj⟩H+1(nct)2∑xi,xj∈Xct⟨xi,xj⟩H−2ncsnct∑xi∈Xcs,xj∈Xct⟨xi,xj⟩H, | (30) |
(Sst)cintra=ncs+nctncsnct[∑xi∈Xcst⟨xi,xi⟩H−1ncs∑xi,xj∈Xcs⟨xi,xj⟩H−1nct∑xi,xj∈Xct⟨xi,xj⟩H], | (31) |
(Sst)cvar=ncs+nctncsnct∑xj∈Xcst⟨xj,xj⟩H−1ncsnct∑xi,xj∈Xcs⟨xi,xj⟩H |
−1ncsnct∑xi,xj∈Xct⟨xi,xj⟩H−2ncsnct∑xi∈Xcs,xj∈Xct⟨xi,xj⟩H, | (32) |
Ludcwmmd(Xs,Xt)=∑Cc=1[(β+1)(ncs+nct)ncsnct∑xj∈Xcst⟨xj,xj⟩H−(ncs+(ncs+nct)β(ncs)2nct)∑xi∈Xcs∑xj∈Xcs⟨xi,xj⟩H−(nct+(ncs+nct)βncs(nct)2)∑xi∈Xct∑xj∈Xct⟨xi,xj⟩H−2ncsnct∑xi∈Xcs∑xj∈Xct⟨xi,xj⟩H] |
+1ns(ns−1)∑xi,xj∈Xsxi≠xj⟨xi,xj⟩H+1nt(nt−1)∑xi,xj∈Xtxi≠xj⟨xi,xj⟩H |
−2nsnt∑xi∈Xs∑xj∈Xt⟨xi,xj⟩H, | (33) |
Ludcwmmd(Zs,Zt)=∑Cc=1[α1∑zj∈Zcst⟨zj,zj⟩H+α2∑zi,zj∈Zcs⟨zi,zj⟩H+α3∑zi,zj∈Zct⟨zi,zj⟩H+α4∑xi∈Zj∈Zct⟨zi,zj⟩H] |
+γ1∑zi,zj∈Zszi≠zj⟨zi,zj⟩H+γ2∑zi,zj∈Ztzi≠zj⟨zi,zj⟩H+γ3∑zi∈Zs,zj∈Zt⟨zi,zj⟩H. | (34) |
A categorical cross-entropy, Lcls, is commonly used as the error for the classifier's classification results on the source domain data, as shown in formula (35). Here, ˆlcsj is the c-th element of ˆlsi=C(zsi) and ycsi is the c-th element of the ground truth one-hot label vector ysi, where ycsi=1 if the label of the original sample xsi corresponding to zsi is c, and ycsi=0 otherwise. To encourage the samples to form dense, uniform, and well-separated clusters, the label-smoothing (LS) technique [42] is applied to the cross-entropy loss. This involves substituting the smooth label (1−α)ycsi+α/C, a weighted average of ycsi and 1/C, with ycsi in the categorical cross-entropy to form the smoothed categorical cross-entropy Llscls(Zs,Ys), as shown in (36). Here, α is a smoothing factor generally set to 0.1 for better performance and C represents the number of classes. The goal of LS is to prevent the model from becoming too confident in its predictions and to reduce overfitting. Rafael Müller et al. [42] have shown that LS encourages the representations of training examples from the same class to group in tight clusters.
During training with target samples, the entropy of predicted results for those samples is minimized, as illustrated in formula (37). This strategy is employed because it has been indicated that unlabeled examples are especially beneficial when class overlap is small [43]. Minimizing this entropy encourages the predicted results to be more inclined toward a specific category, making the feature distribution between categories in the target domain more distinct and explicit. The training loss function of the entire network is defined as LDCWDA, as shown in formula (38), where ω1 and ω2 are weight parameters.
Lcls(Zs,Ys)=1ns∑nsi=1∑Cc=1ycsilogˆlcsi, | (35) |
Llscls(Zs,Ys)=1ns∑nsi=1∑Cc=1((1−α)ycsi+α/C)logˆlcsi, | (36) |
Lent(Zt)=1nt∑ntj=1∑Cc=1ˆlctjlogˆlctj, | (37) |
LDCWDA=Ludcwmmd+ω1Llscls+ω2Lent. | (38) |
The training architecture of the proposed discriminative class-wise domain adaptation (DCWDA) system, as shown in Figure 2, consists of a feature extractor (F) used for extracting domain-invariant features and a classifier (C). The feature extractor F and the classifier C are duplicated to represent the data paths of the source domain and the target domain, and a dotted line is drawn in the middle to indicate shared weights. During training, the source domain samples xs and target domain samples xt are first separately input into the feature extractor F, which outputs features zs=F(xs) and zt=F(xt). The discriminative class-wise loss function Ludcwmmd is then computed for zs and zt. Subsequently, zs and zt are separately input into the classifier C, producing classification results ˆls=C(zs) and ˆlt=C(zt). This allows the calculation of the cross-entropy Lcls for the predicted result for the source domain sample xs and the entropy Lent for the predicted result of the target domain sample xt. The training algorithm is shown in Algorithm 1, where the batch sizes of both the source sample and the target sample are set to N.
Algorithm 1. Training the DCWDA model. |
Input: Δs,Δt,α,ω1,ω2,η2; Initialize parameters θF and θC; # train the model parameters θF and θC on Δs and Δt; repeat until convergence (Xs,Ys)={(xs1,ys1),(xs2,ys2),…,(xsN,ysN)}← mini-batch from Δs; Xt={xt1,xt2,…,xtN}← mini-batch from Δt; Zs← F (Xs); Zt← F (Xt); # generate pseudo labels: ˆLt={ˆlt1,ˆlt2,…,ˆltN} ← C(F(Xt)); # classifier target sample Yt={yt1,yt2,…,ytN} ={M(ˆlt1),M(ˆlt2),…,M(ˆltN)}; # obtain pseudo labels # M((v1,v2,…,vC)) =argmax1≤c≤Cvc; # evaluate losses: Ludcwmmd(Xs,Xt)=…; # using (29) Llscls←1N∑Ni=1∑Cc=1((1−α)ycsi+α/C)logˆlcsj; # using (36) Lent← 1N∑Nj=1∑Cc=1ˆlctjlogˆlctj; # using (37) LDCWDA←Ludcwmmd+ ω1Llscls+ω2Lent; # update θF and θC to minimize LDCWDA; θF←θF−η2∇θFLDCWDA; θC←θC−η2∇θCLDCWDA; end repeat |
The proposed method was evaluated using digit datasets and office object data. The digit datasets used in the experiments include modified national institute of standards and technology database (MNIST) [44], U.S. postal service (USPS) [45], and street view house numbers (SVHN) [46]. MNIST and USPS are handwritten datasets. MNIST has 60, 000 training samples and 10, 000 testing samples, all grayscale images of size 28×28. USPS consists of 9, 298 grayscale images of size 16x16. SVHN contains 73, 257 training images and 26, 032 test images, which are color images of size 32×32 captured from street-view house number photo images. For each image, the digit to be recognized are a single digit in a house number located in the center of the image, surrounded by other digits or distracting objects. In the experiment, the images are scaled to a size of 32×32 pixels. Figure 2 shows some images from MNIST, USPS, and SVHN, and the image in each blue frame is used as a training sample. Figure 3 displays some images from MNIST, USPS, and SVHN, where the numbers within blue frames in SVHN images are the digits to be recognized. The Office-31 [47] dataset comprises three domains: Amazon (A), DSLR (Digital Single - Lens Reflex) (D), and Webcam (W). Each domain comprises 31 object categories in an office environment, totaling 4, 110 images, with varying numbers of images for each category. Figure 4 displays some images from Webcam, DSLR, and Amazon.
In the training process, the batch sizes for the digit dataset and Office-31 dataset are set to 128 and 64, respectively. Resnet-18 and Resnet-50 [1] are adopted as the network architectures for the feature extractors on the digit dataset and the Office-31 dataset, respectively. Both architectures undergo fine-tuning with pre-trained ImageNet network parameters. In addition, the pseudo-labels of all target domain training data are updated with the current classifier parameters at the beginning of each epoch.
The accuracies of various combinations of source domain and target domain were evaluated. The combinations for digital datasets include: MNIST to USPS (M → U), USPS to MNIST (U → M), and SVHN to MNIST (S → M). The combinations for Office-31 datasets include: Amazon to DSLR (A → D), Amazon to Webcam (A → W), DSLR to Amazon (D → A), DSLR to Webcam (D → W), Webcam to Amazon (W → A), and Webcam to DSLR (W → D). Table 1 compares the proposed method with various unsupervised domain adaptation methods on the digit datasets, including adversarial discriminative domain adaptation (ADDA) [17], adversarial dropout regularization (ADR) [48], conditional adversarial domain adaptation (CDAN) [49], cycle-consistent adversarial domain adaptation (CyCADA) [50], sliced wasserstein discrepancy (SWD) [51] and source hypothesis transfer (SHOT) [36]. Table 2 compares the proposed method with various unsupervised domain adaptation methods on the Office-31dataset, including: Wang et al. [33], deep adaptation networks (DAN) [18], domain-adversarial neural network (DANN) [16], ADDA [17], multi-adversarial domain adaptation (MADA) [52], SHOT [36], collaborative and adversarial network (CAN) [14] and mini-batch dynamic geometric embedding (MDGE) [23]. Each accuracy represents the average accuracy rate of three test results. The best-performing methods for each source-to-target combination are highlighted in bold. The "Source-only" category indicates that the classifier is directly trained using the source domain data without domain adaptation, and then tested using the target domain data. The "Target-supervised" category shows that the classifier is directly trained using the target domain data and tested using the target data. Typically, the accuracies of "source-only" and "target supervised" serve as the lower and upper bounds for domain adaptation accuracy, but there's no guarantee that the accuracy will fall within this range.
As can be seen from Table 2, the proposed method outperforms other methods in testing most digital dataset pairs except S → M, and achieves the highest average accuracy. It is worth noting that SVHN images have obvious color changes and noise. Compared with other digital imaging datasets, the USPS is a smaller digital dataset with smaller images. Hence, the test results for the combinations of USPS and SVHN are not very informative. In view of these results, the datasets M → S, S → U, and U → S are not used in the digital dataset experiment. As can be seen from Table 3, the proposed method outperforms other methods in testing two of the three digital dataset combinations and achieves the highest average accuracy.
source → target methods | M → U | U → M | S → M | Average |
Source-only | 69.6 | 82.2 | 67.1 | 73.0 |
ADDA [17] | 90.1 | 89.4 | 76.0 | 85.2 |
ADR [48] | 93.1 | 93.2 | 95.0 | 93.8 |
CDAN [49] | 98.0 | 95.6 | 89.2 | 94.3 |
CyCADA [50] | 96.5 | 95.6 | 90.4 | 94.2 |
SWD [51] | 97.1 | 98.1 | 98.9 | 98.0 |
SHOT [36] | 97.8 | 97.6 | 99.0 | 98.1 |
ours | 98.0 | 98.2 | 98.8 | 98.3 |
target-supervised | 99.4 | 98.1 | 99.4 | 98.9 |
source → target methods | A → D | A → W | D → A | D → W | W → A | W → D | Average |
Source-only | 68.9 | 68.4 | 62.5 | 96.7 | 60.7 | 99.3 | 76.1 |
Wang et al. [33] | 90.76 | 88.93 | 75.43 | 98.49 | 75.15 | 99.80 | 88.06 |
DAN [18] | 78.6 | 80.5 | 63.6 | 97.1 | 62.8 | 99.6 | 80.4 |
DANN[16] | 79.7 | 82.0 | 68.2 | 96.9 | 67.4 | 99.1 | 82.2 |
ADDA [17] | 77.8 | 86.2 | 69.5 | 96.2 | 68.9 | 98.4 | 82.9 |
MADA [52] | 87.8 | 90.0 | 70.3 | 97.4 | 66.4 | 99.6 | 85.2 |
SHOT [36] | 93.9 | 90.1 | 75.3 | 98.7 | 75.0 | 99.9 | 88.8 |
CAN [14] | 95.0 | 94.5 | 78.0 | 99.1 | 77.0 | 99.8 | 90.6 |
MDGE [23] | 90.6 | 89.4 | 69.5 | 98.9 | 68.4 | 99.8 | 86.1 |
ours | 96.3 | 94.9 | 77.9 | 99.5 | 76.5 | 99.6 | 90.8 |
target-supervised | 98.0 | 98.7 | 86.0 | 98.7 | 86.0 | 98.0 | 94.3 |
In this paper, we tackled the domain adaptation problem by using a deep network architecture with a DCWMMD as a loss function. The MMD used is based on embedding distribution metrics in the reproducing kernel Hilbert space. This not only leverages the kernel trick to enhance computational efficiency but also conforms to the original MMD definition. Marginal MMD helps align the data distributions regardless of class alignment. To alleviate this limitation, CWMMD was introduced to align data distributions of the same class from the two domains. However, this adjustment may lead to a reduction in feature discriminativeness. By deconstructing CWMMD into variance minus intra-class distance, an adjustable weight parameter for the intra-class distance term was introduced, providing flexibility to preserve feature discriminability. The experimental results show that our proposed method improves upon the approach proposed by Wang et al. [33]. In terms of the error function, we not only applied the LS technique to the cross entropy for the training of the source domain, but we also added the entropy of the predicted label for the target samples to enhance the overall training performance. The proposed architecture was evaluated using two datasets, the digital dataset and Office-31 dataset. The results demonstrate competitive accuracy rates for domain adaptation when compared to other methods.
In the future, we will continue to improve the performance of training process in the system, such as applying data augmentation to increase the diversity of data, using high-confidence data from the target domain to provide pseudo-labels for supervised post-processing training, etc. Last but not least, we also want to apply our work to other domain adaptation tasks, such as face recognition, object recognition, and image-to-image translation.
The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.
This work was supported by the National Science and Technology Council, Taiwan, R.O.C. under the grant NSTC 112-2221-E-032-041.
The authors declare no conflict of interest.
[1] | K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, In: Proceedings of conference on computer vision and pattern recognition (CVPR), 2016,770–778. https://doi.org/10.1109/CVPR.2016.90 |
[2] |
S. Ren, K. He, R. Girshick, J. Sun, Faster R-cnn: Towards real-time object detection with region proposal networks, IEEE Trans. Pattern Anal. Machine Intel., 39 (2017), 1137–1149. https://doi.org/10.1109/TPAMI.2016.2577031 doi: 10.1109/TPAMI.2016.2577031
![]() |
[3] | K. He, G. Gkioxari, P. Dollár, R. Girshick, Mask R-CNN, In: 2017 IEEE International conference on computer vision (ICCV), 2017, 2980–2988. https://doi.org/10.1109/ICCV.2017.322 |
[4] |
S. J. Pan, Q. Yang, A survey on transfer learning, IEEE Trans. Knowl. Data Eng., 22 (2010), 1345–1359. https://doi.org/10.1109/TKDE.2009.191 doi: 10.1109/TKDE.2009.191
![]() |
[5] | J. Huang, A. J. Smola, A. Gretton, K. M. Borgwardt, B. Schö lkopf, Correcting sample selection bias by unlabeled data, In: Advances in neural information processing systems, The MIT Press, 2007. https://doi.org/10.7551/mitpress/7503.003.0080 |
[6] |
S. Li, S. Song, G. Huang, Prediction reweighting for domain adaptation, IEEE Trans. Neural Netw. Learn. Syst., 28 (2017), 1682–169. https://doi.org/10.1109/TNNLS.2016.2538282 doi: 10.1109/TNNLS.2016.2538282
![]() |
[7] | M. Baktashmotlagh, M. T. Harandi, B. C. Lovell, M. Salzmann, Domain adaptation on the statistical manifold, In: 2014 IEEE conference on computer vision and pattern recognition, 2014, 2481–2488. https://doi.org/10.1109/CVPR.2014.318 |
[8] | M. Long, J. Wang, G. Ding, J. Sun, P. S. Yu, Transfer feature learning with joint distribution adaptation, In: 2013 IEEE international conference on computer vision, 2013, 2200–2207. https://doi.org/10.1109/ICCV.2013.274 |
[9] | M. Long, J. Wang, G. Ding, J. Sun, P. S. Yu, Transfer joint matching for unsupervised domain adaptation, In: 2014 IEEE conference on computer vision and pattern recognition, 2014, 1410–1417. https://doi.org/10.1109/CVPR.2014.183 |
[10] | M. Baktashmotlagh, M. T. Harandi, B. C. Lovell, M. Salzmann, Unsupervised domain adaptation by domain invariant projection, In: 2013 IEEE international conference on computer vision, 2013,769–776. https://doi.org/10.1109/ICCV.2013.100 |
[11] | S. J. Pan, J. T. Kwok, Q. Yang, Transfer learning via dimensionality reduction, In: Proceedings of the AAAI conference on artificial intelligence, 23 (2008), 677–682. |
[12] |
M. Long, J. Wang, G. Ding, S. J. Pan, P. S. Yu, Adaptation regularization: A general framework for transfer learning, IEEE Trans. Knowl. Data Eng., 26 (2014), 1076–1089. https://doi.org/10.1109/TKDE.2013.111 doi: 10.1109/TKDE.2013.111
![]() |
[13] |
L. Bruzzone, M. Marconcini, Domain adaptation problems: A DASVM classification technique and a circular validation strategy, IEEE Trans. Pattern Anal. Machine Intell., 32 (2010), 770–787. https://doi.org/10.1109/TPAMI.2009.57 doi: 10.1109/TPAMI.2009.57
![]() |
[14] | W. Zhang, W. Ouyang, W. Li, D. Xu, Collaborative and adversarial network for unsupervised domain adaptation, In: 2018 IEEE/CVF conference on computer vision and pattern recognition, 2018. https://doi.org/10.1109/CVPR.2018.00400 |
[15] | K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, D. Krishnan, Unsupervised pixel-level domain adaptation with generative adversarial networks, In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), 2017, 95–104. https://doi.org/10.1109/CVPR.2017.18 |
[16] | Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, et al., Domain adversarial training of neural networks, J. Machine Learn. Res., 17 (2016), 1–35. |
[17] | E. Tzeng, J. Hoffman, K. Saenko, T. Darrell, Adversarial discriminative domain adaptation, In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), 2017, 2962–2971. https://doi.org/10.1109/CVPR.2017.316 |
[18] | M. Long, Y. Cao, J. Wang, M. I. Jordan, Learning transferable features with deep adaptation networks, In: Proceedings of the 32nd international conference on international conference on machine learning, 37 (2015), 97–105. |
[19] | M. Long, H. Zhu, J. Wang, M. I. Jordan, Unsupervised domain adaptation with residual transfer networks, In: Proceedings of the 30th international conference on neural information processing systems, 2016, 136–144. https://dl.acm.org/doi/10.5555/3157096.3157112 |
[20] | B. Sun and K. Saenko, Deep coral: Correlation alignment for deep domain adaptation, In: European conference on computer vision, 2016,443–450. https://doi.org/10.1007/978-3-319-49409-8_35 |
[21] | M. Ghifary, W. B. Kleijn, M. Zhang, D. Balduzzi, W. Li, Deep reconstruction-classification networks for unsupervised domain adaptation, In: European conference on computer vision, 2016,597–613. https://doi.org/10.1007/978-3-319-46493-0_36 |
[22] |
S. Khan, M. Asim, S. Khan, A. Musyafa, Q. Wu, Unsupervised domain adaptation using fuzzy rules and stochastic hierarchical convolutional neural networks, Comput. Elect. Eng., 105 (2023), 108547. https://doi.org/10.1016/j.compeleceng.2022.108547 doi: 10.1016/j.compeleceng.2022.108547
![]() |
[23] |
S. Khan, Y. Guo, Y. Ye, C. Li, Q. Wu, Mini-batch dynamic geometric embedding for unsupervised domain adaptation, Neural Process. Lett., 55 (2023), 2063–2080. https://doi.org/10.1007/s11063-023-11167-7 doi: 10.1007/s11063-023-11167-7
![]() |
[24] |
L. Zhang, W. Zuo, D. Zhang, LSDT: Latent sparse domain transfer learning for visual adaptation, IEEE Trans. Image Process., 25 (2016), 1177–1191. https://doi.org/10.1109/TIP.2016.2516952 doi: 10.1109/TIP.2016.2516952
![]() |
[25] | Y. Chen, W. Li, C. Sakaridis, D. Dai, L. V. Gool, Domain adaptive faster R-CNN for object detection in the wild, In: 2018 IEEE/CVF conference on computer vision and pattern recognition, 2018, 3339–3348. https://doi.org/10.1109/CVPR.2018.00352 |
[26] | K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, D. Krishnan, Unsupervised pixel-level domain adaptation with generative adversarial networks, In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), 2017, 95–104. https://doi.org/10.1109/CVPR.2017.18 |
[27] | H. Xu, J. Zheng, A. Alavi, R. Chellappa, Cross-domain visual recognition via domain adaptive dictionary learning, arXiv: 1804.04687, 2018. https://doi.org/10.48550/arXiv.1804.04687 |
[28] | A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Scholkopf, A. Smola, A kernel two-sample test, J. Machine Learn. Res., 13 (2012), 723–773. https://doi.org/10.5555/2188385.2188410 |
[29] |
S. J. Pan, I. W. Tsang, J. T. Kwok, Q. Yang, Domain adaptation via transfer component analysis, IEEE Trans. Neural Netw., 22 (2011), 199–210. https://doi.org/10.1109/TNN.2010.2091281 doi: 10.1109/TNN.2010.2091281
![]() |
[30] |
K. M. Borgwardt, A. Gretton, M. J. Rasch, H. P. Kriegel, B. Scholkopf, A. J. Smola, Integrating structured biological data by kernel maximum mean discrepancy, Bioinformatics, 22 (2006), e49–e57. https://doi.org/10.1093/bioinformatics/btl242 doi: 10.1093/bioinformatics/btl242
![]() |
[31] |
S. Si, D. Tao, B. Geng, Bregman divergence-based regularization for transfer subspace learning, IEEE Trans. Knowl. Data Eng., 22 (2010), 929–942. https://doi.org/10.1109/TKDE.2009.126 doi: 10.1109/TKDE.2009.126
![]() |
[32] | J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, J. Wortman, Learning bounds for domain adaptation, In: Advances in neural information processing systems, 20 (2007), 129–136. |
[33] | W. Wang, H. Li, Z. Ding, Z. Wang, Rethink maximum mean discrepancy for domain adaptation, arXiv: 2007.00689, 2020. https://doi.org/10.48550/arXiv.2007.00689 |
[34] | L. Devroye, G. Lugosi, Combinatorial methods in density estimation, In: Combinatorial methods in density estimation, New York: Springer, 2001. https://doi.org/10.1007/978-1-4613-0125-7 |
[35] |
Y. Baraud, L. Birgé, Rho-estimators revisited: General theory and applications, Ann. Statist., 46 (2018), 3767–3804. https://doi.org/10.1214/17-AOS1675 doi: 10.1214/17-AOS1675
![]() |
[36] | J. Liang, D. Hu, J. Feng, Do we really need to access the source data? Source hypothesis transfer for unsupervised domain adaptation, In: Proceedings of the 37th international conference on machine learning, 119 (2020), 6028–6039. |
[37] | L. Song, A. Gretton, D. Bickson, Y. Low, C. Guestrin, Kernel belief propagation, In: Proceedings of the 14th international conference on artificial intelligence and statistics, 15 (2011), 707–715. |
[38] | M. Park, W. Jitkrittum, D. Sejdinovic, K2-ABC: Approximate bayesian computation with kernel embeddings, In: Proceedings of the 19th international conference on artificial intelligence and statistics, 51 (2015), 398–407. |
[39] | W. Jitkrittum, W. Xu, Z. Szabo, K. Fukumizu, A. Gretton, A linear-time kernel goodness-of-fit test, In: Advances in neural information processing systems, 2017, 262–271. |
[40] |
Y. Li, K. Swersky, R. S. Zemel, Generative moment matching networks, arXiv:1502.02761, 2015. https://doi.org/10.48550/arXiv.1502.02761 doi: 10.48550/arXiv.1502.02761
![]() |
[41] |
S. Zhao, J. Song, S. Ermon, Infovae: Information maximizing variational autoencoders, arXiv:1706.02262, 2018. https://doi.org/10.48550/arXiv.1706.02262 doi: 10.48550/arXiv.1706.02262
![]() |
[42] | R. Müller, S. Kornblith, G. Hinton, When does label smoothing help? In: 33rd Conference on neural information processing systems, 2019. |
[43] | Y. Grandvalet, Y. Bengio, Semi-supervised learning by entropy minimization, In: Advances in neural information processing systems, 17 (2004), 529–536. |
[44] |
Y. Lecun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition, Proc. IEEE, 86 (1998), 2278–2324. https://doi.org/10.1109/5.726791 doi: 10.1109/5.726791
![]() |
[45] |
J. J. Hull, A database for handwritten text recognition research, IEEE Trans. Pattern Anal. Machine Intell., 16 (1994), 550–55. https://doi.org/10.1109/34.291440 doi: 10.1109/34.291440
![]() |
[46] | Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, A. Ng, Reading digits in natural images with unsupervised feature learning, Proc. Int. Conf. Neural Inf. Process. Syst. Workshops, 2011. |
[47] | K. Saenko, B. Kulis, M. Fritz, T. Darrell, Adapting visual category models to new domains, In: Lecture notes in computer science, Berlin: Springer, 6314 (2010), 213–226. https://doi.org/10.1007/978-3-642-15561-1_16 |
[48] |
K. Saito, Y. Ushiku, T. Harada, K. Saenko, Adversarial dropout regularization, arXiv:1711.01575, 2018. https://doi.org/10.48550/arXiv.1711.01575 doi: 10.48550/arXiv.1711.01575
![]() |
[49] | M. Long, Z. Cao, J. Wang, M. I. Jordan, Conditional adversarial domain adaptation, In: 32nd Conference on neural information processing systems, 2018, 1647–1657. |
[50] | J. Hoffman, E. Tzeng, T. Park, J. Y. Zhu, P. Isola, K. Saenko, et al., Cycada: Cycle-consistent adversarial domain adaptation, In: Proceedings of the 35th international conference on machine learning, 2018, 1989–1998. |
[51] | C. Y. Lee, T. Batra, M. H. Baig, D. Ulbricht, Sliced wasserstein discrepancy for unsupervised domain adaptation, In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2019, 10285–10295. |
[52] | Z. Pei, Z. Cao, M. Long, J. Wang, Multi-adversarial domain adaptation, In: Thirty-second AAAI conference on artificial intelligence, 32 (2018). https://doi.org/10.1609/aaai.v32i1.11767 |
1. | Xiaoshun Wang, Sibei Luo, Pairwise Similarity for Domain Adaptation, 2024, 12, 2169-3536, 109184, 10.1109/ACCESS.2024.3439870 | |
2. | Keyi Zhou, Ningyun Lu, Bin Jiang, Zhisheng Ye, FEV-Swin: Multi-source heterogeneous information fusion under a variant swin transformer framework for intelligent cross-domain fault diagnosis, 2025, 09507051, 112982, 10.1016/j.knosys.2025.112982 | |
3. | Shizhao Ma, Yunhai Gao, Shuquan Feng, Lin Li, Mengyuan Ma, 2024, Domain-adapted polyp image semantic segmentation utilizing a generative adversarial network and U-Net framework, 9798400717826, 551, 10.1145/3706890.3706984 | |
4. | Hsiau-Wen Lin, Trang-Thi Ho, Ching-Ting Tu, Hwei-Jen Lin, Chen-Hsiang Yu, MeTa Learning-Based Optimization of Unsupervised Domain Adaptation Deep Networks, 2025, 13, 2227-7390, 226, 10.3390/math13020226 |
symbol | meaning |
Xcs | set of samples of class c from source domain |
Xct | set of samples of class c from target domain |
Xc(=Xcst) | Xc=Xcs∪Xct, set of samples of class c from source and target domains |
Xs | Xs=⋃Cc=1Xcs, set of samples from source domain |
Xt | Xt=⋃Cc=1Xct, set of samples from target domain |
X(=Xst) | X=Xs∪Xt, set of samples from source and target domains |
ncs | ‖Xcs‖, number of samples in Xcs |
nct | ‖Xct‖, number of samples in Xct |
nc(=ncst) | ‖Xc‖=ncs+nct, number of samples in Xc |
ns | ‖Xs‖, number of samples in Xs |
nt | ‖Xt‖, number of samples in Xt |
n(=nst) | ‖X‖=∑Cc=1nc=ns+nt, number of samples in X |
mcs | (1/ncs)∑xi∈Xcsxi, mean of Xcs |
mct | (1/nct)∑xi∈Xctxi, mean of Xct |
mc(=mcst) | (1/nc)∑xi∈Xcxi, mean of Xc |
ms | (1/ns)∑xi∈Xsxi, mean of Xs |
mt | (1/nt)∑xi∈Xtxi, mean of Xt |
m(=mst) | (1/n)∑xi∈Xxi, mean of X |
source → target methods | M → U | U → M | S → M | Average |
Source-only | 69.6 | 82.2 | 67.1 | 73.0 |
ADDA [17] | 90.1 | 89.4 | 76.0 | 85.2 |
ADR [48] | 93.1 | 93.2 | 95.0 | 93.8 |
CDAN [49] | 98.0 | 95.6 | 89.2 | 94.3 |
CyCADA [50] | 96.5 | 95.6 | 90.4 | 94.2 |
SWD [51] | 97.1 | 98.1 | 98.9 | 98.0 |
SHOT [36] | 97.8 | 97.6 | 99.0 | 98.1 |
ours | 98.0 | 98.2 | 98.8 | 98.3 |
target-supervised | 99.4 | 98.1 | 99.4 | 98.9 |
source → target methods | A → D | A → W | D → A | D → W | W → A | W → D | Average |
Source-only | 68.9 | 68.4 | 62.5 | 96.7 | 60.7 | 99.3 | 76.1 |
Wang et al. [33] | 90.76 | 88.93 | 75.43 | 98.49 | 75.15 | 99.80 | 88.06 |
DAN [18] | 78.6 | 80.5 | 63.6 | 97.1 | 62.8 | 99.6 | 80.4 |
DANN[16] | 79.7 | 82.0 | 68.2 | 96.9 | 67.4 | 99.1 | 82.2 |
ADDA [17] | 77.8 | 86.2 | 69.5 | 96.2 | 68.9 | 98.4 | 82.9 |
MADA [52] | 87.8 | 90.0 | 70.3 | 97.4 | 66.4 | 99.6 | 85.2 |
SHOT [36] | 93.9 | 90.1 | 75.3 | 98.7 | 75.0 | 99.9 | 88.8 |
CAN [14] | 95.0 | 94.5 | 78.0 | 99.1 | 77.0 | 99.8 | 90.6 |
MDGE [23] | 90.6 | 89.4 | 69.5 | 98.9 | 68.4 | 99.8 | 86.1 |
ours | 96.3 | 94.9 | 77.9 | 99.5 | 76.5 | 99.6 | 90.8 |
target-supervised | 98.0 | 98.7 | 86.0 | 98.7 | 86.0 | 98.0 | 94.3 |
symbol | meaning |
Xcs | set of samples of class c from source domain |
Xct | set of samples of class c from target domain |
Xc(=Xcst) | Xc=Xcs∪Xct, set of samples of class c from source and target domains |
Xs | Xs=⋃Cc=1Xcs, set of samples from source domain |
Xt | Xt=⋃Cc=1Xct, set of samples from target domain |
X(=Xst) | X=Xs∪Xt, set of samples from source and target domains |
ncs | ‖Xcs‖, number of samples in Xcs |
nct | ‖Xct‖, number of samples in Xct |
nc(=ncst) | ‖Xc‖=ncs+nct, number of samples in Xc |
ns | ‖Xs‖, number of samples in Xs |
nt | ‖Xt‖, number of samples in Xt |
n(=nst) | ‖X‖=∑Cc=1nc=ns+nt, number of samples in X |
mcs | (1/ncs)∑xi∈Xcsxi, mean of Xcs |
mct | (1/nct)∑xi∈Xctxi, mean of Xct |
mc(=mcst) | (1/nc)∑xi∈Xcxi, mean of Xc |
ms | (1/ns)∑xi∈Xsxi, mean of Xs |
mt | (1/nt)∑xi∈Xtxi, mean of Xt |
m(=mst) | (1/n)∑xi∈Xxi, mean of X |
source → target methods | M → U | U → M | S → M | Average |
Source-only | 69.6 | 82.2 | 67.1 | 73.0 |
ADDA [17] | 90.1 | 89.4 | 76.0 | 85.2 |
ADR [48] | 93.1 | 93.2 | 95.0 | 93.8 |
CDAN [49] | 98.0 | 95.6 | 89.2 | 94.3 |
CyCADA [50] | 96.5 | 95.6 | 90.4 | 94.2 |
SWD [51] | 97.1 | 98.1 | 98.9 | 98.0 |
SHOT [36] | 97.8 | 97.6 | 99.0 | 98.1 |
ours | 98.0 | 98.2 | 98.8 | 98.3 |
target-supervised | 99.4 | 98.1 | 99.4 | 98.9 |
source → target methods | A → D | A → W | D → A | D → W | W → A | W → D | Average |
Source-only | 68.9 | 68.4 | 62.5 | 96.7 | 60.7 | 99.3 | 76.1 |
Wang et al. [33] | 90.76 | 88.93 | 75.43 | 98.49 | 75.15 | 99.80 | 88.06 |
DAN [18] | 78.6 | 80.5 | 63.6 | 97.1 | 62.8 | 99.6 | 80.4 |
DANN[16] | 79.7 | 82.0 | 68.2 | 96.9 | 67.4 | 99.1 | 82.2 |
ADDA [17] | 77.8 | 86.2 | 69.5 | 96.2 | 68.9 | 98.4 | 82.9 |
MADA [52] | 87.8 | 90.0 | 70.3 | 97.4 | 66.4 | 99.6 | 85.2 |
SHOT [36] | 93.9 | 90.1 | 75.3 | 98.7 | 75.0 | 99.9 | 88.8 |
CAN [14] | 95.0 | 94.5 | 78.0 | 99.1 | 77.0 | 99.8 | 90.6 |
MDGE [23] | 90.6 | 89.4 | 69.5 | 98.9 | 68.4 | 99.8 | 86.1 |
ours | 96.3 | 94.9 | 77.9 | 99.5 | 76.5 | 99.6 | 90.8 |
target-supervised | 98.0 | 98.7 | 86.0 | 98.7 | 86.0 | 98.0 | 94.3 |