Complementary label learning based on knowledge distillation

Peng Ying; Zhongnian Li; Renke Sun; Xinzheng Xu; Peng Ying; Zhongnian Li; Renke Sun; Xinzheng Xu

doi:10.3934/mbe.2023796

Mathematical Biosciences and Engineering

2023, Volume 20, Issue 10: 17905-17918. doi: 10.3934/mbe.2023796

Previous Article Next Article

Research article Special Issues

Complementary label learning based on knowledge distillation

School of Computer Science and Technology, China University of Mining and Technology, Xuzhou 221116, China

Academic Editor: Jorge Bernardino

Received: 08 July 2023 Revised: 02 September 2023 Accepted: 12 September 2023 Published: 19 September 2023

Complementary label learning (CLL) is a type of weakly supervised learning method that utilizes the category of samples that do not belong to a certain class to learn their true category. However, current CLL methods mainly rely on rewriting classification losses without fully leveraging the supervisory information in complementary labels. Therefore, enhancing the supervised information in complementary labels is a promising approach to improve the performance of CLL. In this paper, we propose a novel framework called Complementary Label Enhancement based on Knowledge Distillation (KDCL) to address the lack of attention given to complementary labels. KDCL consists of two deep neural networks: a teacher model and a student model. The teacher model focuses on softening complementary labels to enrich the supervision information in them, while the student model learns from the complementary labels that have been softened by the teacher model. Both the teacher and student models are trained on the dataset that contains only complementary labels. To evaluate the effectiveness of KDCL, we conducted experiments on four datasets, namely MNIST, F-MNIST, K-MNIST and CIFAR-10, using two sets of teacher-student models (Lenet-5+MLP and DenseNet-121+ResNet-18) and three CLL algorithms (PC, FWD and SCL-NL). Our experimental results demonstrate that models optimized by KDCL outperform those trained only with complementary labels in terms of accuracy.

Keywords:

Citation: Peng Ying, Zhongnian Li, Renke Sun, Xinzheng Xu. Complementary label learning based on knowledge distillation[J]. Mathematical Biosciences and Engineering, 2023, 20(10): 17905-17918. doi: 10.3934/mbe.2023796

Related Papers:

[1]	Yufeng Qian . Exploration of machine algorithms based on deep learning model and feature extraction. Mathematical Biosciences and Engineering, 2021, 18(6): 7602-7618. doi: 10.3934/mbe.2021376
[2]	Yu Li, Meilong Zhu, Guangmin Sun, Jiayang Chen, Xiaorong Zhu, Jinkui Yang . Weakly supervised training for eye fundus lesion segmentation in patients with diabetic retinopathy. Mathematical Biosciences and Engineering, 2022, 19(5): 5293-5311. doi: 10.3934/mbe.2022248
[3]	Keying Du, Liuyang Fang, Jie Chen, Dongdong Chen, Hua Lai . CTFusion: CNN-transformer-based self-supervised learning for infrared and visible image fusion. Mathematical Biosciences and Engineering, 2024, 21(7): 6710-6730. doi: 10.3934/mbe.2024294
[4]	Qiao Pan, Chen Huang, Dehua Chen . A method based on multi-standard active learning to recognize entities in electronic medical record. Mathematical Biosciences and Engineering, 2021, 18(2): 1000-1021. doi: 10.3934/mbe.2021054
[5]	Yue Li, Hongmei Jin, Zhanli Li . A weakly supervised learning-based segmentation network for dental diseases. Mathematical Biosciences and Engineering, 2023, 20(2): 2039-2060. doi: 10.3934/mbe.2023094
[6]	Xiaobo Zhang, Donghai Zhai, Yan Yang, Yiling Zhang, Chunlin Wang . A novel semi-supervised multi-view clustering framework for screening Parkinson's disease. Mathematical Biosciences and Engineering, 2020, 17(4): 3395-3411. doi: 10.3934/mbe.2020192
[7]	Yanghan Ou, Siqin Sun, Haitao Gan, Ran Zhou, Zhi Yang . An improved self-supervised learning for EEG classification. Mathematical Biosciences and Engineering, 2022, 19(7): 6907-6922. doi: 10.3934/mbe.2022325
[8]	Zhanhong Qiu, Weiyan Gan, Zhi Yang, Ran Zhou, Haitao Gan . Dual uncertainty-guided multi-model pseudo-label learning for semi-supervised medical image segmentation. Mathematical Biosciences and Engineering, 2024, 21(2): 2212-2232. doi: 10.3934/mbe.2024097
[9]	Jingyao Liu, Qinghe Feng, Yu Miao, Wei He, Weili Shi, Zhengang Jiang . COVID-19 disease identification network based on weakly supervised feature selection. Mathematical Biosciences and Engineering, 2023, 20(5): 9327-9348. doi: 10.3934/mbe.2023409
[10]	Ruoqi Zhang, Xiaoming Huang, Qiang Zhu . Weakly supervised salient object detection via image category annotation. Mathematical Biosciences and Engineering, 2023, 20(12): 21359-21381. doi: 10.3934/mbe.2023945

Abstract

1. Introduction

Supervised learning is an important branch of machine learning. In supervised multi-classification problems, each sample is assigned a label which indicates the category it belongs to ^[1]. Supervised learning is effective when there are enough samples with high quality labels. However, it is expensive and time-consuming to build datasets with a multitude of accurate labels. To solve this problem, researchers have proposed a series of weakly supervised learning (WSL) methods, which aim to train models with partial, incomplete or inaccurate supervised information, such as noise-label learning ^[2,3,4,5], semi-supervised learning ^[6,7,8,9], partial-label learning ^[10,11,12], positive-confidence learning ^[13], unlabeled-unlabeled learning ^[14] and others.

In this paper, we consider another WSLframework called complementary label learning (CLL). We show the difference between complemtary labels and true labels in Figure 1. Compared to an ordinary label, a complementary label indicates the class that the sample does not belong to. Obviously, it is easier and less costly to collect these complementary labels. For example, in some very specialized domains, the expert knowledge is very expensive. If complementary labels are used for annotation, we need to only determine the extent of the label space and then use common sense to determine which category is wrong. It is much simpler and faster to determine which class a sample does not belong to than it belongs to. Besides, CLL can also protect data privacy in some sensitive fields like medical and financial records because we no longer need to disclose the true information of the data. This not only protects data privacy and security, but also makes it easier to collect data in these areas.

Figure 1. Comparison of the complementary labels (bottom) with the real labels (top). Complementary label is one of categories the image does not belong to.

DownLoad: Full-Size Img PowerPoint

The framework of CLL was first proposed by Ishida et al. ^[15]. They proved that the unbiased risk estimator (URE) only from complementary labels is equivalent to the ordinary classification risk when the loss function satisfies certain conditions. In URE, the loss function must be nonconvex and symmetric which leads to certain limitations. To overcome this limitation, Yu et al. ^[16] made cross-entropy loss usable in CLL by constructing a complementary label transition matrix, and they also considered that different labels had different probability of being selected as a complementary label. Then, Ishida et al. ^[17] expanded URE and proposed a CLL framework adapted to more general loss functions. This framework still has an unbiased estimator of the regular classification risk, but it works for all loss functions. Chou et al. ^[18] optimized URE from gradient estimation, and proposed that using surrogate complementary loss (SCL) to obtain unbiased risk estimation, which effectively alleviated the problem of overfitting in URE. Liu et al. ^[19] applied common losses such as categorical cross entropy (CCE), mean square error (MSE) and mean absolute error (MAE) to CLL. Ishiguro et al. ^[20] conducted a study on the problem that complementary labels may be affected by label noise. To mitigate its adverse effects, they selected losses with noise robustness which satisfied weighted symmetric condition or a more relaxed condition. Recently, Zhang et al. ^[21] broadened the setting of complementary label datasets and discussed the case that the datasets contained a large number of complementary labels and a small number of true labels at the same time. They proposed an adversarial complementary label learning network, named Clarinet. Clarinet consists of two deep neural networks, one to classify complementary labels and true labels, and the other to learn from complementary labels.

Previous studies on CLL always focus on rewriting the classification risk under the ordinary label distribution to the risk under the complementary label distribution and exploring the use of more loss functions ^{[15,16,17,18,19]}. These rewriting risk techniques prove the consistency relationship between the risk of complementary label classification and the risk of supervised classification. This enables the classifier to perform accurate classification using only the complementary labels. However, in this process, only complementary labels are involved in the risk calculation, and the information contained in them is extremely limited, which results in consistently lower performance of CLL compared to supervised learning. Therefore, we aim to enhance the supervision information of the complementary labels to further improve the performance of CLL. In this paper, we propose a two-step complementary label enhancement framework based on knowledge distillation (KDCL). It consists of the following components: 1) a teacher model trained on complementary label dataset to generate soft labels which contain more supervision information as label distribution; 2) a student model trained on the same dataset to learn from both soft labels and complementary labels; 3) a final loss function to integrate loss from soft labels and complementary labels and update parameters of the student model. We use three CLL loss functions to conduct experiments on several benchmark datasets, and compare the accuracy of the student model before and after enhancement by KDCL. The experimental results show that KDCL can effectively improve the performance of CLL.

2. Preliminaries

2.1. Learning from true labels

Supposing that the input sample is a $d$ -dimensional vector $x\in {\mathbb{R}}^{d}$ with class labels $y\in \{{\mathrm{1, 2}}, ..., K\}$ , where $K$ stands for $K$ classes in the dataset. Giving a training set $D = {\left\{\left({x}_{i}, {y}_{i}\right)\right\}}_{i = 1}^{N}$ with $N$ samples, all of which independently follow the same distribution $p(x, y)$ . The goal of learning from true labels is to learn a mapping relation $f\left(x\right)$ from the sample space ${\mathbb{R}}^{d}$ to the label space $\{{\mathrm{1, 2}}, ..., K\}$ and $f\left(x\right)$ is also called a classifier. We want $f\left(x\right)$ to minimize the multi-class classification risk:

$\begin{array}{c}R(f) = {\mathbb{E}}_{p\left(x, y\right) \sim D}\left[L\left(f\left(x\right), y\right)\right], \end{array}$

(1)

where $L\left(f\left(x\right), y\right)$ is multi-class loss function, $f\left(x\right)$ is usually obtained by the following equation:

$\begin{array}{c}f\left(x\right) = {argmax}_{y\in {\mathrm{1, 2}}, \dots , K}{g}_{y}\left(x\right), \end{array}$

(2)

where $g\left(x\right):{\mathbb{R}}^{d}\to {\mathbb{R}}^{K}$ . In deep neural networks, $g\left(x\right)$ is the prediction distribution of the output from the last fully connected layer.

In general, distribution $p(x, y)$ is unknown. We can use the sample mean to approximate the classification risk in Eq (1). $R(f)$ is empirically estimated as $\widehat{R}(f)$ :

$\begin{array}{c}\widehat{R}(f) = \frac{1}{N}\sum\limits _{i = 1}^{n}L\left(f\left({x}_{i}\right), {y}_{i}\right), \end{array}$

(3)

where $N$ is the number of training data and $i$ is the $i$ -th sample.

2.2. Learning from complementary labels

In CLL, each sample $x$ is assigned only one complementary label $\bar{y}$ . Therefore, the dataset is switched from $D = {\left\{\left({x}_{i}, {y}_{i}\right)\right\}}_{i = 1}^{N}$ to $\bar{D} = {\left\{\left({x}_{i}, {\bar{y}}_{i}\right)\right\}}_{i = 1}^{N}$ , where $\bar{y}\in \{{\mathrm{1, 2}}, ..., K\}\backslash \left\{y\right\}$ and $D\ne \bar{D}$ . $\bar{D}$ independently follow an unknown distribution $\bar{p}(x, \bar{y})$ . If all complementary labels are selected in an unbiased way, which means that they have the same probability of being chosen, $\bar{p}(x, \bar{y})$ can be presented as:

$\begin{array}{c}\bar{p}\left(x, \bar{y}\right) = \frac{1}{K-1}\sum\limits _{y\ne \bar{y}}p\left(x, y\right).\end{array}$

(4)

Supposing that $\bar{L}\left(f\left(x\right), \bar{y}\right)$ is complementary loss function, we can obtain similar multi-class risk as Eq (1) in distribution $\bar{p}\left(x, \bar{y}\right)$ :

$\begin{array}{c}\bar{R}(f) = {\mathbb{E}}_{\bar{p}\left(x, \bar{y}\right) \sim \bar{D}}\left[\bar{L}\left(f\left(x\right), \bar{y}\right)\right].\end{array}$

(5)

To our best knowledge, Ishida et al. ^[15] are the first to prove that the difference between Eq (1) and Eq (5) is constant when the loss function $\bar{L}$ satisfies certain conditions and this constant $M$ only depends on the number of categories $K$ :

$R(f) = \left(K-1\right){\mathbb{E}}_{\bar{p}\left(x, \bar{y}\right) \sim \bar{D}}\left[\bar{L}\left(f\left(x\right), \bar{y}\right)\right]+M \\ = \left(K-1\right)\bar{R}(f)+M.$

(6)

All coefficients are constant when the loss function satisfies the condition. So it is possible to learn from complementary labels by minimizing $R(f)$ in Eq (6). Then, they rewrite one-versus-all (OVA) loss ${L}_{OVA}$ and pairwise-comparison (PC) loss ${L}_{PC}$ in ordinary multi-class classification as ${\bar{L}}_{OVA}$ and ${\bar{L}}_{PC}$ in CLL:

${\bar{L}}_{OVA}\left(g\left(x\right), \bar{y}\right) = \frac{1}{K-1}\sum\limits _{y\ne \bar{y}}l\left({g}_{y}\left(x\right)\right)+l\left({-g}_{\bar{y}}\left(x\right)\right),\\{\bar{L}}_{PC}\left(g\left(x\right), \bar{y}\right) = \sum\limits _{y\ne \bar{y}}l\left({g}_{y}\left(x\right)-{g}_{\bar{y}}\left(x\right)\right),$

(7)

where $l\left(z\right):\mathbb{R}\to \mathbb{R}$ is a binary loss and it must be nonconvex and symmetric, such as sigmoid loss. $g\left(x\right)$ is the same as Eq (2) and ${g}_{y}\left(x\right)$ is the $y$ -th element of $g\left(x\right)$ . Finally, the unbiased risk estimator of $R(f)$ can be obtained by sample mean:

$\widehat{R}(f)\frac{\left(K-1\right)}{N}\sum\limits _{n = 1}^{N}\bar{L}\left(f\left({x}_{n}\right), {\bar{y}}_{n}\right)+M.$

(8)

Although it is feasible to learn a classifier that minimizes Eq (8) from complementary labels, the restriction on the loss function limits the application of URE. Yu et al. ^[16] analyze the relationship between ordinary and complementary labels in terms of conditional probability:

$P\left(\bar{y} = \left.j\right|x\right) = \sum\limits _{i\ne j}P\left(\bar{y} = j|y = i\right)P\left(y = i|x\right),$

(9)

where $\forall i, j\in \{{\mathrm{1, 2}}, \dots, K\}$ . When all complementary labels are selected in an unbiased way, $P\left(\bar{y}|y\right)$ can be expressed as a transition matrix $Q$ :

$Q = {\left[\begin{array}{ccc}0& \cdots & \frac{1}{K-1}\\ \vdots& \ddots & \vdots\\ \frac{1}{K-1}& \frac{1}{K-1}& 0\end{array}\right]}_{K\times K},$

(10)

where each element in $Q$ represents $P\left(\bar{y} = j|y = i\right)$ . Since the true label and the complementary label of the sample are mutually-exclusive, that is $P\left(\bar{y} = j|y = i\right) = 0$ . Therefore, the entries on the diagonal of the matrix are 0.

Combining Eqs (5), (9) and (10), we can rewrite $\bar{R}(f)$ as:

$\bar{R}(f) = {\mathbb{E}}_{\bar{p}\left(x, \bar{y}\right)}\left[{L}_{CE}\left({Q}^{T}g\left(x\right), \bar{y}\right)\right],$

(11)

where ${L}_{CE}$ is cross-entropy loss which is widely used in deep learning. The classification risk $\bar{R}(f)$ in Eq (8) is also consistent with the ordinary classification risk $R(f)$ ^[16].

3. Complementary label learning based on knowledge distillation

3.1. Framework architecture

In image classification, outputs from the last fully connected layer of a deep neural network contain the predicted probability distribution of all classes after the Softmax function. Comparing with a single logical label, the outputs carry more information. Hinton et al. ^[22] define the outputs as soft labels and propose a knowledge distillation framework. We draw on the idea of knowledge distillation and hope to improve the performance of CLL by enhancing complementary labels through soft labels.

In the framework of knowledge distillation, Hinton et al. ^[22] modify the Softmax function and they introduce the parameter $T$ to control the smoothness of soft labels. The ordinary Softmax function can be expressed as follows:

${y}_{i}^{\text{'}} = \frac{exp\left({y}_{i}\right)}{\sum _{j}exp\left({y}_{j}\right)},$

(12)

where ${y}_{i}^{\text{'}}$ is the predicted probability of the $i$ -th class, $exp(\cdot)$ is the exponential function and ${y}_{i}$ is the predicted output of the classification network for the $i$ th class. The Softmax function combines the prediction outputs of the model for all classes, and uses the exponential function to normalize the output values in the interval [0, 1].

The rewritten Softmax function is as follows:

${y}_{i}^{\text{'}} = \frac{exp\left({}^{{{y}_{i}}}\!\!\diagup\!\!{}_{T}\;\right)}{\sum _{j}exp\left({}^{{{y}_{i}}}\!\!\diagup\!\!{}_{T}\;\right)}.$

(13)

We present a comparison of the smoothness of soft labels for different $T$ in . As $T$ gradually increases, soft labels will become smoother. Actually, $T$ regulates the degree to the attention to the negative labels. The higher $T$ , the more attention is paid to negative labels. $T$ is an adjustable hyperparameter during training.

Figure 2. The smoothness of soft labels for different

$T$ . The higher

$T$ , the smoother soft labels will be.

DownLoad: Full-Size Img PowerPoint

For one sample, soft labels not only clarify its correct category, but also contain the correlation between other labels. More abundant information is carried in soft labels than the complementary label. If we add an extra term to the ordinary supplementary label classification loss and introduce soft labels as additional supervision information, CLL will perform better than using only complementary labels. Of course, we need a model with high accuracy to produce soft labels, which will make the soft labels more credible. This model is also trained by complementary labels.

Taking advantage of this property, we propose KDCL, a complementary label learning framework based on knowledge distillation. The overall structure is shown in Figure 3.

Figure 3. The framework architecture of KDCL.

$\alpha$ and

$\beta$ are the weighting factors to balance KL loss and complementary loss.

DownLoad: Full-Size Img PowerPoint

KDCL is a two-stage training framework consisting of a more complex teacher model with higher accuracy and a simpler student model with lower accuracy. First, the teacher model is trained with complementary labels on the dataset and predicts all samples in the training set. The prediction results are normalized by the Softmax function with $T = t(t > 1)$ to generate soft labels ${S}_{tea}$ . Second, the student model is trained and its outputs are processed in two ways, one to produce the soft prediction results ${S}_{stu}$ with $T = t(t > 1)$ , and the other to output ordinary prediction results ${P}_{stu}$ with $T = 1$ . Then, the KL divergence between ${S}_{tea}$ and ${S}_{stu}$ is calculated, and the complementary label loss between ${P}_{stu}$ and the complementary labels is calculated at the same time. The two losses are weighted to obtain the final distillation loss. Finally, parameters of the student model will be updated by the final loss.

In KDCL, the final loss consists of Kullback-Leible (KL) loss and complementary loss. On the one hand, the student model needs to learn knowledge from the teacher model to improve its ability. On the other hand, the teacher model is not completely correct, and the student model also needs to learn by itself to reduce the influence of the teacher modelâ€™s errors on the learning process. It is better to consider both of them.

3.2. Loss function design

The final distillation loss consists of two parts and it can be expressed as follows:

${L}_{KDCL} = \alpha {L}_{KL}+{L}_{CL},$

(14)

where ${L}_{KL}$ denotes the KL divergence and ${L}_{CL}$ denotes the complementary loss. Given the probability distributions ${p}_{t}$ from the teacher model and ${p}_{s}$ from the student model, their KL divergence can be expressed as follows:

${L}_{KL}\left({p}_{t}, {p}_{s}\right) = \sum\limits _{i}-{p}_{ti}{\mathit{log}}\frac{{p}_{si}}{{p}_{ti}},$

(15)

where $i$ denotes the $i$ -th element in tensor ${p}_{t}$ or ${p}_{s}$ .

We select three complementary losses for KDCL. They are the PC loss proposed by Ishida et al. ^[15], FWD loss proposed by Yu et al. ^[16] and SCL-NL loss proposed by Chou et al. ^[18]. Supposing that ${p}_{s}$ is the probability distribution for sample $x$ from the student model and $\bar{y}$ is the complementary label of $x$ , these complementary losses are shown in Eqs (16)–(18).

${\bar{L}}_{PC}\left({p}_{s}, \bar{y}\right) = \frac{K-1}{n}\sum\limits _{y\ne \bar{y}}\left({p}_{sy}-{{p}_{s}}_{\bar{y}}\right)-\frac{K\times \left(K-1\right)}{2}+K-1,$

(16)

${\bar{L}}_{FWD}\left({p}_{s}, \bar{y}\right) = -\sum\limits _{i}{\bar{y}}_{i}\times {\mathit{log}}{{({Q}^{T}\times p}_{s}}_{i}),$

(17)

${\bar{L}}_{SCL-NL}\left({p}_{s}, \bar{y}\right) = \sum\limits _{i}{\bar{y}}_{i}\times \left(-log\left(1-{{p}_{s}}_{\bar{y}}\right)\right),$

(18)

where $K$ denotes the number of categories of the dataset, and ${Q}^{T}$ denotes the transpose of $Q$ which is a $K\times K$ square matrix with all entries $1/(K-1)$ except the diagonal.

With parameters ${p}_{t}, {p}_{s}{\mathrm{a}}{\mathrm{n}}{\mathrm{d}}\bar{y}$ , the final loss can be expressed in more detail as follows:

${L}_{KD-PC}\left({p}_{t}, {p}_{s}, \bar{y}\right) = \alpha {L}_{KL}\left({p}_{t}, {p}_{s}\right)+{\bar{L}}_{PC}\left({p}_{s}, \bar{y}\right)$

(19)

${L}_{KD-FWD}\left({p}_{t}, {p}_{s}, \bar{y}\right) = \alpha {L}_{KL}\left({p}_{t}, {p}_{s}\right)+{\bar{L}}_{FWD}\left({p}_{s}, \bar{y}\right)$

(20)

${L}_{KD-SCL}\left({p}_{t}, {p}_{s}, \bar{y}\right) = \alpha {L}_{KL}\left({p}_{t}, {p}_{s}\right)+{\bar{L}}_{SCL-NL}\left({p}_{s}, \bar{y}\right)$

(21)

$\alpha$ is the weighting factor, which is used to control the degree of influence of soft labels on the overall classification loss. The values of $\alpha$ will be determined in the experiment.

4. Experiments

We evaluate and compare the student models optimized by KDCL with the same models only trained by complementary labels on four public image classification datasets. Three complementary label losses including PC loss ^[15], FWD loss ^[16] and SCL-NL loss ^[18], are used as loss functions for training the models. All the experiments are carried out on a server with a 15 vCPU Intel(R) Xeon(R) Platinum 8358P CPU @ 2.60GHz, 80 GB RAM and one RTX 3090 GPU with 24 GB memory.

4.1. Datasets

Four benchmark image classification datasets, including MNIST, Fashion-MNIST(F-MNIST), Kuzushiji-MNIST(K-MNIST) and CIFAR10, are used to verify the effectiveness of KDCL.

MNIST: consists of 60,000 28 × 28 pixel grayscale images for training and 10,000 images for testing, with a total of 10 categories representing numbers between 0 and 9.

F-MNIST: is an alternative dataset to MNIST and consists of 10 categories, 60,000 training images and 10,000 test images, each with a size of 28 × 28 pixels.

K-MNIST: is a dataset derived from 10 Japanese ancient characters widely used between the mid-Heian period and early modern Japan, which is an extension of the MNIST dataset. K-MNIST contains a total of 74,000 gray-scale images of 28 × 28 pixels in 10 categories.

CIFAR10: consists of 60,000 32 × 32 color images, 50,000 of which are used as the training set and 10,000 as the test set. Each category contains 6000 images.

4.2. Experimental settings

Following the settings in ^[15,17,18], we use an unbiased way to select complementary labels for samples in all datasets. Besides, we apply two different sets of teacher-student networks to these datasets. Specifically, for MNIST, F-MNIST and K-MNIST, we chose Lenet-5 ^[23] as the teacher model and MLP ^[24] with 500 hidden neurons as the student model. Because these datasets are relatively simple, simple networks can work well. For CIFAR10 dataset, since color images are more difficult to be classified, we need deeper CNN to extract features. We choose DenseNet-121 ^[25] as the teacher model and ResNet-18 ^[26] as the student model.

In the setting of training details, for MNIST, F-MNIST and K-MNIST, we train Lenet-5 and MLP with 120 epochs and use SGD as the optimizer with a momentum 0.9 and a weight decay of 0.0001. The initial learning rate is 0.1 and it is halved every 30 epochs. The batch size is set to 128. For CIFAR10 dataset, we train DenseNet-121 and ResNet-18 with 80 epochs and use SGD as the optimizer with a momentum 0.9 and a weight decay of 0.0005. The learning rate is from {1e-1, 1e-2, 5e-3, 1e-3, 5e-4, 1e-4} and it is divided by 10 every 30 epochs.

4.3. Parameter sensitivity analysis

In , we make a parameter sensitivity analysis of the distillation temperature $T$ in Eq (13) and the soft label weighting factor ${\mathrm{\alpha }}$ in Eqs (19)–(21).

Figure 4. Test accuracy results of different

$T$ with fixed

${\mathrm{\alpha }}$ and comparison results of different

$\alpha$ with fixed

$T$ . The experiments are conducted with Lenet-5 and MLP on MNIST, F-MNIST, K-MNIST and Desenet-121 and Resnet-18 on CIFAR-10.

DownLoad: Full-Size Img PowerPoint

We first explore the influence of different distillation temperature $T$ . As we can see, when $T = 1$ , which means directly using the probability distribution output by the teacher model as soft labels without softening, KDCL exhibits the worst accuracy. This is because when the temperature is low, there is a significant difference in soft labels between positive and negative classes, making it difficult for the student model to learn effectively. As $T$ gradually increases, the soft labels become more and more smooth, and student model can easily learn the knowledge in soft labels, and the accuracy is gradually improved. When $T\ge 80$ , the gap between positive and negative classes in soft labels is extremely small, as well as the influence of negative classes is too large, which leads to the accuracy no longer increasing, or even decreasing.

Then, we further investigate the optimal value of soft label weighting factor $\alpha$ . We follow the setting in Hinton et al. ^[22], and set $\alpha$ in the range of 0 to 1. On the same dataset, the change of $\alpha$ does not have a great impact on the accuracy of KDCL. This indicates that the KDCL model parameter optimization process is not sensitive to the hyperparameter $\alpha$ . Nevertheless, the model still achieves higher accuracy when $\alpha = 0.5$ .

Based on the above analysis, we will set $T = 80$ , $\alpha = 0.5$ in subsequent experiments.

4.4. Experimental results

We show the accuracy for all models with three complementary label losses before and after being optimized by KDCL on four datasets. The results are presented in Table 1.

Table 1. Comparison of classification accuracies between different methods using different network architectures on MNIST, F-MNIST, K-MNIST and CIFAR-10.

Dataset	MNIST			F-MNIST			K-MNIST			CIFAR-10
Model	Lenet-5	MLP	KDCL-MLP	Lenet-5	MLP	KDCL-MLP	Lenet-5	MLP	KDCL-MLP	Lenet-5	MLP	KDCL-MLP
PC	89.94%	83.78%	86.10%	77.22%	76.67%	77.42%	67.77%	60.52%	60.34%	38.31%	32.74%	33.37%
FWD	85.35%	83.67%	84.61%	85.35%	83.67%	84.61%	86.85%	70.86%	75.41%	60.74%	44.93%	46.65%
SCL-NL	98.18%	92.06%	94.33%	85.93%	83.69%	84.66%	86.85%	70.59%	75.25%	61.64%	40.46%	45.98%

| Show Table

DownLoad: CSV

In Table 1, we show the experimental results of KDCL, where we compare the performance of the student model optimized by KDCL with that trained only with complementary labels across different losses and datasets. On MNIST, which is a relatively simple and easy dataset, all methods can achieve high accuracies. With the help of KDCL, we improve the accuracy of MLP from 83.78% to 86.10% with PC loss, 92.07% to 94.32% with FWD loss and 92.06% to 94.33% with SCL-NL loss. SCL-NL loss performs better among three loss functions. Besides, after being enhanced by KDCL, the accuracy of KDCL-MLP falls between the accuracy of MLP model and Lenet-5. On F-MNIST, which is more complex than MNIST, all methods have a slight decrease. Our KDCL achives 77.42% with PC loss, 84.61% with FWD loss and 84.66% with SCL-NL loss. On K-MNIST, which is more complex than F-MNIST, when using PC loss, our method does not significantly improve the accuracy of MLP, but we improve 4.55% with FWD loss and 4.66% with SCL-NL loss. On CIFAR-10, which is the most complex among the four datasets, there is a significant drop in accuracies. Nevertheless, the student model can still be optimized by KDCL, demonstrating its robustness and effectiveness across different datasets.

We show the testing process of all models in Figure 5.

Figure 5. Comparison of the testing process of teacher models, student models and KDCL-student models on four datasets.

DownLoad: Full-Size Img PowerPoint

In Figure 5, we present the convergence speed of all models in our experiments. The results show that the student model distilled by KDCL converges faster than that trained only with complementary labels. This indicates that the model can learn the features of the images more accurately and efficiently when utilizing both soft labels and complementary labels.

Additionally, we observe that the PC loss exhibits a decrease in accuracy on more challenging datasets, particularly on CIFAR10. This is because the PC loss uses the Sigmoid function as the normalization function, which can lead to negative values in the loss calculation and prevent the model from finding better parameters when updating. This phenomenon becomes more pronounced on the CIFAR10 dataset, where a peak appears. However, KDCL can alleviate this phenomenon and shift the peak to a later epoch. This demonstrates the effectiveness of KDCL in addressing the limitations of existing CLL methods and improving the performance of complementary label learning.

5. Discussion

In this study, we established a knowledge distillation training framework for CLL, called KDCL. As stated in the introduction, the supervision information in complementary labels is easily missed. The proposed framework employed a deep CNN model with higher accuracy to soften complementary labels to soft labels. Both soft labels and origion complementary labels are used to train the classification model. After the optimization of KDCL, compared to just using the normal CLL methods, the accuracy has been improved by 0.5–4.5%.

The main limitation lies in multiple aspects. First, KDCLâ€™s performance could be influenced by the choice of teacher-student models and CLL algorithms. Our experiments utilize specific combinations of models and algorithms, and the results may vary with different configurations. By choosing better CNN networks and more excellent CLL algorithms, KDCL can achieve better performance on more difficult datasets. Another drawback of the proposed scheme is time cost. Due to the two-stage training framework of KDCL, which involves training a high-accuracy teacher model using complementary labels, the overall training time cost of KDCL is relatively high. Training a high-accuracy model typically takes a considerable amount of time, which poses a challenge to the efficiency of KDCL. In addition, KDCL is only tested on public datasets, and the data distribution is relatively uniform. In the future, we also consider expanding the application scope of KDCL to use dynamically imbalanced data for CLL, or to combine with hybrid deep learning models ^[27,28,29].

6. Conclusions

In this paper, we give the first attempt to leverage the knowledge distillation training framework in CLL. To enhance the supervised information present in complementary labels, which are often overlooked in existing CLL methods, we propose a complementary label enhancement framework based on knowledge distillation, called KDCL. Specifically, KDCL consists of a teacher model and a student model. By adopting knowledge distillation techniques, the teacher model transfers its softened knowledge to the student model. The student model then learns from both soft labels and complementary labels to improve its classification performance. The experimental results on four benchmark datasets show that KDCL can improve the classification accuracy of CLL, and maintain robustness and effectiveness on difficult datasets.

Use of AI tools declaration

The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (No. 61976217, 62306320), the Natural Science Foundation of Jiangsu Province (No. BK20231063), the Fundamental Research Funds of Central Universities (No. 2019XKQYMS87), Science and Technology Planning Project of Xuzhou (No. KC21193).

Conflict of interest

All authors declare that they have no conflicts of interest.

References

[1]	Y. Katsura, M. Uchida, Bridging ordinary-label learning and complementary-label learning, in Proceedings of the 12th Asian Conference on Machine Learning (ACML), 129 (2020), 161–176.
[2]	Y. Li, J. Yang, Y. Song, L. Cao, J. Luo, L. J. Li, Learning from noisy labels with distillation, in 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 97 (2017), 1928–1936. https://doi.org/10.1109/ICCV.2017.211
[3]	M. Hu, H. Han, S. Shan, X. Chen, Weakly Supervised image classification through noise regularization, in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, (2019), 11509–11517. https://doi.org/10.1109/CVPR.2019.01178
[4]	K. H. Lee, X. He, L. Zhang, L. Yang, CleanNet: Transfer learning for scalable image classifier training with label noise, in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, (2018), 5447–5456. https://doi.org/10.1109/CVPR.2018.00571
[5]	X. Xia, T. Liu, N. Wang, B. Han, C. Gong, G. Niu, et al., Are anchor points really indispensable in label-noise learning, in Proceedings of the 33rd International Conference on Neural Information Processing Systems (NeurIPS), Vancouver, (2019), 6838–6849.
[6]	X. Zhai, A. Oliver, A. Kolesnikov, L. Beyer, S4L: Self-supervised semi-supervised learning, in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, (2019), 1476–1485. https://doi.org/10.1109/ICCV.2019.00156
[7]	D. Berthelot, N. Carlini, I. Goodfellow, N. Papernot, A. Oliver, C. A. Raffel, MixMatch: a holistic approach to semi-supervised learning, in Proceedings of the 33rd International Conference on Neural Information Processing Systems (NeurIPS), Vancouver, (2019), 5049–5059.
[8]	T. Miyato, S. I. Maeda, M. Koyama, S. Ishii, Virtual adversarial training: A regularization method for supervised and semi-supervised learning, IEEE Trans. Pattern Anal. Mach. Intell., 41 (2019), 1979–1993. https://doi.org/10.1109/TPAMI.2018.2858821 doi: 10.1109/TPAMI.2018.2858821
[9]	T. Sakai, M. C. Plessis, G. Niu, M. Sugiyama, Semi-supervised classification based on classification from positive and unlabeled data, in Proceedings of the 34th International Conference on Machine Learning (ICML), (2017), 2998–3006.
[10]	Y. Yan, Y. Guo, Partial label learning with batch label correction, in Proceedings of the AAAI Conference on Artificial Intelligence, New York, 34 (2020), 6575–6582. https://doi.org/10.1609/aaai.v34i04.6132
[11]	N. Xu, J. Lv, X. Geng, Partial label learning via label enhancement, in Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, 33 (2019), 5557–5564. https://doi.org/10.1609/aaai.v33i01.33015557
[12]	M. L. Zhang, F. Yu, Solving the partial label learning problem: an instance-based approach, in Proceedings of the 24th International Joint Conference on Artificial Intelligence (IJCAI), Buenos Aires, (2015), 4048–4054.
[13]	T. Ishida, G. Niu, M. Sugiyama, Binary classification from positive-confidence data, in Proceedings of the 32nd International Conference on Neural Information Processing Systems (NeurIPS), Palais, (2018), 5921–5932.
[14]	N. Lu, G. Niu, A. K. Menon, M. Sugiyama, On the minimal supervision for training any binary classifier from only unlabeled data, preprint, arXiv: 1808.10585.
[15]	T. Ishida, G. Niu, W. Hu, M. Sugiyama, Learning from complementary labels, in Proceedings of the 31st International Conference on Neural Information Processing System (NeurIPS), Long Beach, (2017), 5644–5654.
[16]	X. Yu, T. Liu, M. Gong, D. Tao, Learning with biased complementary labels, in Computer Vision—ECCV 2018, Springer, Cham, 11205 (2018), 68–83. https://doi.org/10.1007/978-3-030-01246-5_5
[17]	T. Ishida, G. Niu, A. Menon, M. Sugiyama, Complementary-label learning for arbitrary losses and models, in Proceedings of the 36th International Conference on Machine Learning (ICML), 97 (2019), 2971–2980.
[18]	Y. T. Chou, G. Niu, H. T. Lin, M. Sugiyama, Unbiased risk estimators can mislead: A case study of learning with complementary labels, in Proceedings of the 37th International Conference on Machine Learning (ICML), 119 (2020), 1929–1938.
[19]	D. Liu, J. Ning, J. Wu, G. Yang, Extending ordinary-label learning losses to complementary-label learning, IEEE Signal Process. Lett., 28 (2021), 852–856. https://doi.org/10.1109/LSP.2021.3073250 doi: 10.1109/LSP.2021.3073250
[20]	H. Ishiguro, T. Ishida, M. Sugiyama, Learning from noisy complementary labels with robust loss functions, IEICE Trans. Inf. Syst., 105 (2022), 364–376. https://doi.org/10.1587/transinf.2021EDP7035 doi: 10.1587/transinf.2021EDP7035
[21]	Y. Zhang, F. Liu, Z. Fang, B. Yuan, G. Zhang, J. Lu, Learning from a complementary-label source domain: Theory and algorithms, IEEE Trans. Neural Networks Learn. Syst., 33 (2022), 7667–7681. https://doi.org/10.1109/TNNLS.2021.3086093 doi: 10.1109/TNNLS.2021.3086093
[22]	G. Hinton, O. Vinyals, J. Dean, Distilling the knowledge in a neural network, preprint, arXiv: 1503.02531.
[23]	Y. Lecun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition, Proc. IEEE, 86 (1998), 2278–2324. https://doi.org/10.1109/5.726791 doi: 10.1109/5.726791
[24]	F. Rosenblatt, The perceptron: A probabilistic model for information storage and organization in the brain, Psychol. Rev., 65 (1958), 386–408. https://doi.org/10.1037/h0042519 doi: 10.1037/h0042519
[25]	K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, (2016), 770–778. https://doi.org/10.1109/CVPR.2016.90
[26]	G. Huang, Z. Liu, L. van der Maaten, K. Q. Weinberger, Densely connected convolutional networks, in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, (2017), 2261–2269. https://doi.org/10.1109/CVPR.2017.243
[27]	J. Jiang, F. Liu, W. W. Y. Ng, Q. Tang, W. Wang, Q. V. Pham, Dynamic incremental ensemble fuzzy classifier for data streams in green internet of things, IEEE Trans. Green Commun. Networking, 6 (2022), 1316–1329. https://doi.org/10.1109/TGCN.2022.3151716 doi: 10.1109/TGCN.2022.3151716
[28]	L. Zhang, W. Chen, W. Wang, Z. Jin, C. Zhao, Z. Cai, et al., CBGRU: A detection method of smart contract vulnerability based on a hybrid model, Sensors, 22 (2022), 3577. https://doi.org/10.3390/s22093577 doi: 10.3390/s22093577
[29]	J. Jiang, F. Liu, Y. Liu, Q. Tang, B. Wang, G. Zhong, et al., A dynamic ensemble algorithm for anomaly detection in IoT imbalanced data streams, Comput. Commun., 194 (2022), 250–257. https://doi.org/10.1016/j.comcom.2022.07.034 doi: 10.1016/j.comcom.2022.07.034

Reader Comments

Your name:*

Email:*
© 2023 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

Mathematical Biosciences and Engineering

3.9

Metrics

Article views(1616) PDF downloads(69) Cited by(0)

Preview PDF

Download XML

Export Citation

Article outline

Show full outline

Figures and Tables

Figures(5) / Tables(1)

Mathematical Biosciences and Engineering

Complementary label learning based on knowledge distillation

Related Papers:

Abstract

1. Introduction

2. Preliminaries

2.1. Learning from true labels

2.2. Learning from complementary labels

3. Complementary label learning based on knowledge distillation

3.1. Framework architecture

3.2. Loss function design

4. Experiments

4.1. Datasets

4.2. Experimental settings

4.3. Parameter sensitivity analysis

4.4. Experimental results

5. Discussion

6. Conclusions

Use of AI tools declaration

Acknowledgments

Conflict of interest

References

Reader Comments

通讯作者: 陈斌, bchen63@163.com

Metrics

Figures and Tables

Other Articles By Authors

Catalog

Mathematical Biosciences and Engineering

Complementary label learning based on knowledge distillation

Related Papers:

Abstract

1. Introduction

2. Preliminaries

2.1. Learning from true labels

2.2. Learning from complementary labels

3. Complementary label learning based on knowledge distillation

3.1. Framework architecture

3.2. Loss function design

4. Experiments

4.1. Datasets

4.2. Experimental settings

4.3. Parameter sensitivity analysis

4.4. Experimental results

5. Discussion

6. Conclusions

Use of AI tools declaration

Acknowledgments

Conflict of interest

References

Reader Comments

通讯作者: 陈斌, bchen63@163.com

Metrics

Figures and Tables

Other Articles By Authors

Related pages

Tools

Export File

Citation

Format

Content

Catalog