tk | −tk | |
Positive Category: ci | a | b |
Positive Category: −ci | c | d |
There are two main factors involved in documents classification, document representation method and classification algorithm. In this study, we focus on document representation method and demonstrate that the choice of representation methods has impacts on quality of classification results. We propose a document representation strategy for supervised text classification named document representation based on global policy (DRGP), which can obtain an appropriate document representation according to the distribution of terms. The main idea of DRGP is to construct the optimization function through the importance of terms to different categories. In the experiments, we investigate the effects of DRGP on the 20 Newsgroups, Reuters21578 datasets, and using the SVM as classifier. The results show that the DRGP outperforms other text representation strategy schemes, such as Document Max, Document Two Max and global policy.
Citation: Longjia Jia, Bangzuo Zhang. A new document representation based on global policy for supervised term weighting schemes in text categorization[J]. Mathematical Biosciences and Engineering, 2022, 19(5): 5223-5240. doi: 10.3934/mbe.2022245
[1] | Wenfa Qi, Wei Guo, Tong Zhang, Yuxin Liu, Zongming Guo, Xifeng Fang . Robust authentication for paper-based text documents based on text watermarking technology. Mathematical Biosciences and Engineering, 2019, 16(4): 2233-2249. doi: 10.3934/mbe.2019110 |
[2] | Feng Li, Mingfeng Jiang, Hongzeng Xu, Yi Chen, Feng Chen, Wei Nie, Li Wang . Data governance and Gensini score automatic calculation for coronary angiography with deep-learning-based natural language extraction. Mathematical Biosciences and Engineering, 2024, 21(3): 4085-4103. doi: 10.3934/mbe.2024180 |
[3] | Jia Yu, Huiling Peng, Guoqiang Wang, Nianfeng Shi . A topical VAEGAN-IHMM approach for automatic story segmentation. Mathematical Biosciences and Engineering, 2024, 21(7): 6608-6630. doi: 10.3934/mbe.2024289 |
[4] | Min Zuo, Jiaqi Li, Di Wu, Yingjun Wang, Wei Dong, Jianlei Kong, Kang Hu . Advancing document-level event extraction: Integration across texts and reciprocal feedback. Mathematical Biosciences and Engineering, 2023, 20(11): 20050-20072. doi: 10.3934/mbe.2023888 |
[5] | Chaofan Li, Qiong Liu, Kai Ma . DCCL: Dual-channel hybrid neural network combined with self-attention for text classification. Mathematical Biosciences and Engineering, 2023, 20(2): 1981-1992. doi: 10.3934/mbe.2023091 |
[6] | Qing Yang, Jun Chen, Najla Al-Nabhan . Data representation using robust nonnegative matrix factorization for edge computing. Mathematical Biosciences and Engineering, 2022, 19(2): 2147-2178. doi: 10.3934/mbe.2022100 |
[7] | Ting-Huai Ma, Xin Yu, Huan Rong . A comprehensive transfer news headline generation method based on semantic prototype transduction. Mathematical Biosciences and Engineering, 2023, 20(1): 1195-1228. doi: 10.3934/mbe.2023055 |
[8] | Hyeonjeong Ahn, Hyojung Lee . Predicting the transmission trends of COVID-19: an interpretable machine learning approach based on daily, death, and imported cases. Mathematical Biosciences and Engineering, 2024, 21(5): 6150-6166. doi: 10.3934/mbe.2024270 |
[9] | Ruirui Han, Zhichang Zhang, Hao Wei, Deyue Yin . Chinese medical event detection based on event frequency distribution ratio and document consistency. Mathematical Biosciences and Engineering, 2023, 20(6): 11063-11080. doi: 10.3934/mbe.2023489 |
[10] | Manfu Ma, Penghui Sun, Yong Li, Weilong Huo . Predicting the risk of mortality in ICU patients based on dynamic graph attention network of patient similarity. Mathematical Biosciences and Engineering, 2023, 20(8): 15326-15344. doi: 10.3934/mbe.2023685 |
There are two main factors involved in documents classification, document representation method and classification algorithm. In this study, we focus on document representation method and demonstrate that the choice of representation methods has impacts on quality of classification results. We propose a document representation strategy for supervised text classification named document representation based on global policy (DRGP), which can obtain an appropriate document representation according to the distribution of terms. The main idea of DRGP is to construct the optimization function through the importance of terms to different categories. In the experiments, we investigate the effects of DRGP on the 20 Newsgroups, Reuters21578 datasets, and using the SVM as classifier. The results show that the DRGP outperforms other text representation strategy schemes, such as Document Max, Document Two Max and global policy.
Text categorization (TC) is a task of automatically classifying unlabeled natural language documents into a predefined set of semantic categories. As the first and a vital step, text representation converts the content of a textual document into a compact format so that the document can be recognized and classified by classifiers [1]. The vector space model (VSM) is the most widely used text representation model in text categorization. In VSM, a document is represented as a vector in term spaces, such as d = {t1, t2, …, tn}, where n is the number of features in a corpus. The value of ti represents how much the term ti contributes to the semantics of document d. The terms in VSM are extracted from training set. They can be words, phrases, or n-grams, etc [2].
Each document in datasets is represented as a corresponding vector in vector space. The elements in each vector are weighted by term weighting methods. Most studies of term weighting methods for TC has showed that supervised term weighting methods are superior to unsupervised term weighting methods [3]. The traditional term weighting methods for TC are usually borrowed from IR and belong to the unsupervised term weighting methods. The simplest one is binary representation. The most popular one is tf*idf. Note that the tf here also has various variants such as raw term frequency, log (tf), log (tf + 1), or log (tf) + 1. Besides, the idf factor (usually computed as log(N/ni)) also has a number of variants such as log (N/ni + 1)), log (N/ni) + 1), and log (N/ni) -1), etc [3]. TC is a supervised learning task because the category labels of training documents are available in advance. Generally, this known information has been used in supervised term weighting methods in the following ways. One approach is to weight terms by adopting feature selection metrics such as chi-square, information gain, gain ratio, etc. The purpose of feature selection is to select the feature with the most discriminative ability for the current classification problem, reduce the feature dimension of the data set, and improve the classification efficiency and accuracy. Another approach is to weight terms in the interaction with a text classifier. For example, in [4], terms are weighted using an iterative approach involving the k-Nearest Neighbor at each step. For each iteration, the weight is adjusted according to the classification results on the evaluation set. However, this method is usually time-consuming, especially when dealing with large-scale data problems.
The difference between unsupervised term weighting methods and supervised term weighting methods is that supervised term weighting methods use class information in training set. However, most of the existing methods did not discuss the representation of test documents for supervised term weighting methods [5]. Since test documents do not have any class information and supervised term weighting methods require it. A test document can be first represented as |C| different vectors by using estimated distribution of each class, cj, and then it has to be represented as one vector that well describes the document in vector space. |C| is the number of classes.
There are two major strategies, local policy and global policy [6,7,8]. In local policy, each test document in the independent binary classification task will be represented as a single vector. This means that the vector representation of each document is not an independent vector but a corresponding vector collection which combines with specific binary classification tasks. Global policy has been widely used. Each document will have a global independent representation. In most classification tasks, each document is generally assigned to one category and labeled with the most similar class label. For example, when using k-Nearest Neighbor classifier, the category of an unknown document will obtain the nearest k documents by calculating the Euclidean distance (or Mahalanobis distance or Manhattan distance) from other documents, and then determine the category of the unknown document according to the category of these documents. Thus, most of the classification tasks are regarded as single label task and use global policy [9]. In addition, there are some models based on neural networks [10,11,12]. This study focuses on models based on global policy, Younghoong Ko proposed Word Max (W-Max), Document Max (D-Max) and Document Two Max (D-TMax) to optimize text representation method and improve the classification performance [9], the idea behind W-Max is same as traditional global policy method. In these three methods, any method cannot always obtain optimal values for different data sets. These methods are discussed in Section 2.
Due to the above aspects, is there a relatively general strategy for supervised term weighting schemes, and, if yes, which one can achieve better performance? This is the first question we wish to address in this study.
In this study, we investigate several well-known document representation strategies, including "global policy", W-Max, D-Max, and D-TMax for supervised term weighting schemes. Since we have not discovered similar work presently, this investigation is significant and valuable in document representation strategy for supervised term weighting schemes in automatic text categorization. At the same time, a new document representation strategy for supervised term weighting schemes is proposed. These five document representation strategies are tested on two famous document collections, i.e., Reuters-21578 (skewed category distribution) and 20 Newsgroups (uniform category distribution). We discuss the experimental results in detail and draw conclusions from different aspects.
The remainder of this paper is organized as follows. We briefly review several document representation strategies in Section 2. Section 3 introduces our proposed document representation strategy, as well as a short discussion of the method. We show experimental results in Section 4, and finally, we draw conclusions and discuss future work in Section 5.
In the scenario of text categorization, an indexing procedure which converts the raw document into a vector representation is usually necessary since text documents cannot be directly interpreted by a classifier. Document representation is thereby one of the essential components for the construction of a classifier. There are some different types of document representation methods: VSM, Latent Semantic Indexing (LSI) [13] and Latent Dirichlet Allocation (LDA) [14] and the representation method based on word vector model, such as Word2Vec [15].
LSI approximates the source space with fewer dimensions which uses matrix algebra technique. Probabilistic Latent Semantic Indexing (PLSI) has a more solid statistical foundation compared with LSI, since it's based on the likelihood principle and defines a proper generative model of the data [16]. Blei, et al. proposed a more widely used topic model, LDA after PLSI. It can recognize the latent topics of documents and use topic probability distribution for representations. Word2vec model and application by Mikolov et al. have attracted a great amount of attention. Its core idea is to get the vectorial representation of words through the context of words. There are two methods: CBOW and Skip-gram. The vector representations of words learned by word2vec models have been shown to carry semantic meanings. In this representation method, words with similar semantics will have similar vector representation. In this section, we briefly review several document representation strategies based on VSM, the most classical method.
In local policy, each test document in the independent binary classification task will be represented as a single vector.
Global policy is defined as follows.
TW(t)=|C|maxi=1TW(t,ci) | (1) |
where TW(t) is the final weight of a term t; TW(t, ci) is weight of term t in category ci obtained with supervised term weighting methods. In the process of initial representation, a test document can be represented as |C| different vectors. After using appropriate selection policy, it can be represented as one vector which well describes the document. Global policy selects the maximum term value among all categories for each term. Although this method is effective in some cases, it can not ensure that it has the ability to select the most effective term weighting vector for current test samples. Since test documents do not have any class information and supervised term weighting schemes required. Besides local policy and global policy, Younghoong Ko proposed the following three solutions for this problem, i.e., W-Max, D-Max and D-TMax [9]. Now, they are described as follows. Based on the idea of global policy, there are also some methods, such as the sum fsum(tk)=∑|C|i=1f(tk,ci), or the weighted sum fwsum(tk)=∑|C|i=1P(ci)f(tk,ci), or the maximum fmax(tk)=max|C|i=1(tk,ci).These functions try to capture the intuition that the best terms for one category are the ones distributed most differently in the sets of positive and negative examples of the category [17]. Now, we described Younghoong Ko 's methods as follows.
Each term's value of term weighting vector will be replaced by the maximum value of the corresponding dimension's term weight in all categories. After comparing with global policy, we may find that they have the same idea.
The sum of all term weights in each term weighting vector is first calculated and then one term weighting vector with the maximum sum value is selected as the document representation vector.
The sum of all term weights in each term weighting vector is calculated and then two term weighting vectors with the two largest sum values are selected. Then the term weighting vector is constructed by choosing the higher term weighting value from the selected two term weighting vectors for each corresponding dimension's term weight.
To discuss the difference between the above strategies, we take some examples as follows. Assuming that there is a training set D = {d1, d2, …, dn} with m terms, where n is the number of documents and these documents belong to |C| categories. Corresponding to category set C = {c1, c2, …, c|C|}, there is a term weighting vector set V = {v1, v2, …, v|C|}. The matrix M (consist of v1 to v|C|) is defined as follows.
M=[t11t21...t1j...t1mt21t22...t2j...t2m..................ti1ti2...tij...tim..................t|c|1t|c|2...t|c|j...t|c|m] | (2) |
In Eq (2), tij is j-th element of the term weighting vector for category ci, it is calculated using a certain term weighting method (such as tf*tf, tf*or or other term weighting methods). Assuming that the term weighting vector of test set is vd = {w1, w2, …, wm}, wk is the k-th element of vd.
When W-Max is used, wk can be defined as follows.
wk=|C|maxi=1(wik);k∈[1,m] | (3) |
In Eq (3), wik is the k-th element of the vector when category ci is used as a positive category. When D-Max is used, the sum of all term weights in each vector vi is first calculated.
sumi=∑mj=1tij | (4) |
After sorting all sumi, D-Max selects the maximum value summaxi. The term weighting vector corresponding to summaxi will be selected as vd.
When D-TMax is used, the sum of all term weights in each vector vi is first calculated using the above Eq (4). Then the two largest sum values are selected, which are labeled as sumx and sumy, respectively. The subscript x and y are indexs of different categories. The k-th element of vd is calculated as Eq (5).
wk=max(txk,tyk);x≠y∈[1,c],k∈[1,m] | (5) |
When compared with W-Max (global policy) method, the other two methods achieved good performance in Youngjoong Ko's experiments, i.e., D-TMax achieved good performance in Reuters21578 (skewed category distribution) while D-Max achieved good performance in 20 Newsgroups (uniform category distribution) [8]. Through experiments on the two famous document collections, we can draw a conclusion that the same method may have different effects on different data set.
Our review of document representation shows that different data sets require different methods to achieve good performance. How can we know the effect of document representation method before we choose it? In other words, when selecting a document representation strategy for an unknown data set, which method is a better choice? This is the second question we asked in this study. We will present the answer at the end of this paper. Table 1 records the numbers of documents which contain term tk and do not contain term tk under category ci and −ci. Usually, d >> a, b, c.
tk | −tk | |
Positive Category: ci | a | b |
Positive Category: −ci | c | d |
The improper selecting of document representation strategy would lead to the problem of inappropriate to assign the weight to terms. A test document can be first represented as |C| different vectors by using estimated distribution of each category. For some categories, the weight they assign to terms would has a negative impact on the role of terms in classification. To illustrate this, suppose the training set is skewed with 19 documents, 5 terms and 5 categories. The relationship between term, document, and category is shown in Table 2. The number in Table 2 represents the times that a term occurs in a document.
category | document | t1 | t2 | t3 | t4 | t5 |
C1 | d1 | 0 | 0 | 2 | 19 | 3 |
C1 | d2 | 0 | 1 | 3 | 0 | 3 |
C1 | d3 | 0 | 0 | 5 | 16 | 2 |
C1 | d4 | 4 | 0 | 1 | 15 | 2 |
C1 | d5 | 0 | 0 | 2 | 18 | 3 |
C2 | d6 | 0 | 0 | 2 | 14 | 3 |
C2 | d7 | 0 | 1 | 3 | 0 | 3 |
C2 | d8 | 0 | 0 | 5 | 13 | 2 |
C2 | d9 | 4 | 0 | 1 | 11 | 2 |
C2 | d10 | 0 | 0 | 2 | 17 | 3 |
C3 | d11 | 0 | 0 | 3 | 0 | 3 |
C3 | d12 | 0 | 0 | 1 | 1 | 3 |
C3 | d13 | 0 | 1 | 1 | 0 | 2 |
C4 | d14 | 1 | 99 | 3 | 2 | 1 |
C4 | d15 | 2 | 99 | 1 | 1 | 1 |
C4 | d16 | 1 | 99 | 1 | 2 | 1 |
C5 | d17 | 4 | 0 | 3 | 0 | 3 |
C5 | d18 | 4 | 0 | 1 | 0 | 3 |
C5 | d19 | 4 | 1 | 1 | 0 | 2 |
According to some supervised term weighting schemes, here we use tf*rf as an example [1]. Based on notations in table 1, it is defined as follows.
tf∗rf=tf∗log(2+amac(1,c)) | (6) |
The term weights of t1 to t5 for each category are shown in Table 3.
category | t1 | t2 | t3 | t4 | t5 |
C1 | 1.1155 | 1.1699 | 1.2538 | 1.3626 | 1.2538 |
C2 | 1.1155 | 1.1699 | 1.2538 | 1.3626 | 1.2538 |
C3 | 1.0000 | 1.1699 | 1.1375 | 1.0704 | 1.1375 |
C4 | 1.2630 | 1.4510 | 1.0875 | 1.1520 | 1.0875 |
C5 | 1.4594 | 1.0000 | 1.1375 | 1.0000 | 1.1375 |
In Younghoong Ko's methods, D-TMax selects the two largest sum values. For multi-class classification problems, in order to obtain better performance, can more categories be selected? Now we will select 1 to 5 categories to test this hypothesis. When choosing 1 or 2 categories, it is called "D-Max" (Document Max) or "D-TMax" (Document Two Max) [9]. Based on this rule, we named 3, 4 and 5 categories as "D-3Max" (Document Three Max), "D-4Max" (Document Four Max) and "D-5Max" (Document Five Max). For multiclass text categorization, we named it "D-NMax" (Document Number Max) in this study. The number of selected categories and corresponding results are shown in the Table 4.
method | t1 | t2 | t3 | t4 | t5 |
D-Max | 1.1155 | 1.1699 | 1.2538 | 1.3626 | 1.2538 |
D-TMax | 1.1155 | 1.1699 | 1.2538 | 1.3626 | 1.2538 |
D-3Max | 1.2630 | 1.4510 | 1.2538 | 1.3626 | 1.2538 |
D-4Max | 1.4594 | 1.4510 | 1.2538 | 1.3626 | 1.2538 |
D-5Max | 1.4594 | 1.4510 | 1.2538 | 1.3626 | 1.2538 |
According to table 2, the frequency of features in each category, t3 and t5 are evenly distributed in each category and do not have the ability to distinguish categories. According to the distribution of t1, t2 and t4, these three terms are relatively distinguishable, especially t2, which is very concentrated in C4, and its term weight should be the highest among these five terms. Although t1 appears centrally in C5, it also appears in individual documents of categories C1, C2 and C4. It has little ability to distinguish categories. For t4, its distinguishing ability is not specific to a single category. In other words, it can only distinguish C1 and C2 from other categories, but cannot distinguish whether the document belongs to C1 or C2. The result in bold in the Table 4 violates our intuition that the weight of t1, t2 and t4 should be large, and the weight of t1 should be relatively small compared to t2 and t4. Since t1 appear with a low frequency in documents compared to t2 and t4. Another unreasonable observation is that t1 in some documents (d4 and d9) in category 1 and category 2 also have same frequency when compared to t1 in the documents in category 5. After observation, we can find that the results of D-Max, D-TMax and D-5Max (W-Max) cannot boost the performance of text categorization. The results from D-3Max are consistent with our intuition that the weight of t1, t2 and t4 should be large, and the weight of t1 should be relatively small. In order to overcome the shortcomings of Younghoong Ko's methods, in this section we explain our proposed DRGP method, which will choose the traversal method to appropriately weigh the contribution of each term. The DRGP can select the appropriate "N" (in D-NMax) to enhance the performance of text categorization. While our previous work in [18] provided experimental evidence for the effectiveness of DRGP under particular experimental circumstances, we formally address how and why this new method is proposed using analytical explanation and empirical observation. Results in Section 4 show that the proposed strategy outperforms other methods significantly.
In the DRGP method, by traversing the term weighting vectors generated by each class, we compare their weighting effects on the training set. The term weighting vector which produces the best effect on training set will be selected as the term weighting vector of test set. The idea of the proposed method is mainly inspired by Younghoong Ko, Miao and Kamel [19]. They proposed a pairwise optimized Rocchio algorithm, which dynamically adjusts the prototype position between pairs of categories on training set, and record the best prototype position for test set. We summarize the main process of DRGP as Algorithm 1.
Algorithm 1: document representation strategy based on global policy (DRGP) |
Input: fea: feature matrix of training set gnd: a vector of labels for documents in training set Output: selectedC: the most appropriate N value (in D-NMax) for current dataset Local variables |C|: total number of categories; M: total number of features; termWeightingVec1: the set of |C| original term weighting vectors; termWeightingVec1i: i-th vector of the original term weighting vectors; termWeightingVec2i: i-th vector of the reconstructed term weighting vectors; sumVeci: sum value of all terms in i-th term weighting vector; sortSum: sorted list of each sum values; weightedFeai: the weighted fea by using termWeightingVec2i; MicroF1i: result of 10-fold cross validation on weightedFeai; begin 1: apply supervised term weighting method to fea, and get termWeightingVec1; |
2: for i = 1 to |C| 3: for j = 1 to M 4: compute sumVeci for termWeightingVec1i; 5: end for 6: end for 7: sort all sumVeci, and get sortSum; 8: for i = 1 to |C| 9: for j = 1 to M 10: for k = 1 to i 11: tempC = sortSum[k] 12: construct termWeightingVec2i by the following ways. The j-th dimension of each vector in the selected k term weighting vectors is obtained, i.e., termWeightingVec1[tempC][j]. 13: the maximum value of all termWeightingVec1[tempC][j] will be selected as the j-th value of the termWeightingVec2i; 14: end for 15: end for 16: end for 17: for i = 1 to |C| 18: compute weightedFeai; 19: end for 20: for i = 1 to |C| 21: compute MicroF1i; 22: end for 23: record i corresponding to the maximum MicroF1i, and assign it to selectedC; end |
After the algorithm 1 is executed, we can get selectedC, that is, the N value in D-NMax. Then we can use the following Eq (7) to calculate the term weighting vector of test set. The sum of all term weights in each vector vi is first calculated using Eq (4). Then the N largest sum values are selected, which are labeled as sum1, sum2...sumi... and sumN, respectively. The subscript 1, 2...i... and N are index of different categories. The k-th element of vd is calculated as Eq (7).
wk=max(t1k,t2k,…tik…,tNk);iϵ[1,N],kϵ[1,m] | (7) |
In algorithm 1, steps 1 to 6, first apply the term weighting method to the original data set to obtain |C| original term weighting vectors. Sum the original feature weight vectors separately to get sumVeci. Steps 8 to 16, sort all sumVeci to get sortSum, and reconstruct |C| term weighting vectors by the following way. Select the first 1 to |C| vectors of sortSum in turn, and construct one term weighting vector each time. The j-th dimension element of the term weighting vector is selected in the following method: the maximum value of the j-th dimension of the currently selected k term weighting vectors. Steps 17 to 19, through the newly constructed |C| term weighting vector termWeightingVec2, weight the fea matrix respectively, and each termWeightingVec2i will get the corresponding weightedFeai. Steps 20 to 22, verify each weightedFeai through ten-fold cross-validation, and record the corresponding result MicroF1i. Step 23, record the maximum MicroF1 and the corresponding weight vector sequence number i as the result of selectedC. The MicroF1 is the popular performance measure in text categorization and the formula will present in Section 4.
However, the algorithm has a shortcoming for its high time complexity. The most time-complex part is to calculate MicroF1i (steps 19 to 21, in bold), which is calculated using ten-fold cross-validation. According to Lin's 2006 "machine learning summer school", "the time complexity of the support vector machine algorithm is O(N3)". Therefore, the time complexity of this algorithm can be estimated as O(|C|*N3). Compared with some algorithms that use time in exchange for space, the DRGP algorithm uses time in exchange for precision. This algorithm is more suitable for classification systems that use a fixed training set, that is, after one training on the training set to find the optimal model, it can be applied to the classification system without changing the parameters for a long time.
The 20 Newsgroups corpus is a generally used benchmark dataset in the TC [20]. In the corpus, there are 20,000 newsgroup documents nearly uniformly distributed into 20 classes. In this study, we use the 20 Newsgroups sorted by the date. After removing duplicates and headers, the remaining 18,846 documents are partitioned into 11,314 (about 60 percent) training documents and 7532 (about 40 percent) testing documents. After preprocessing, there are 26,214 distinct words in this data set.
The Reuters21578 corpus is used in many experiments and it contains 21,578 documents in 135 categories [21,22]. We use its ModApte version. There are 5946 training documents and 2347 testing documents in this version. In the study, we choose the top 10 largest categories which have 5228 training documents and 2057 testing documents. After preprocessing, the resulting vocabulary has 18,221 distinct words. Compared with 20 Newsgroups, it is a skewed data set and 80 percent of the categories have less than 7.5 percent instances.
The statistics of datasets are listed in Table 5.
Datasets | # of documents | distinct words | # of classes | # of training | # of testing |
20 Newsgroups | 18,846 | 26,214 | 20 | 11,314 | 7532 |
Reuters21578 | 7285 | 18,221 | 10 | 5228 | 2057 |
Some different document representation strategies listed in Table 6 are selected in this study. Besides D-Max, D-TMax and W-Max, we also investigate the D-NMax in the experiment. In the experiment, we list the results of the selected categories from 1 to the maximum. We can see the corresponding results when the number of selected categories is changed.
Denoted by | Description |
D-Max | The sum of all term weights in each term weighting vector is first calculated and then one term weighting vector with the maximum sum value is selected as the document representation vector. |
D-TMax | The sum of all term weights in each term weighting vector is calculated and then two term weighting vectors with the two largest sum values are selected. Then the term weighting vector is constructed by choosing the higher term weighting value from the selected two term weighting vectors for each corresponding dimension's term weight. |
D-NMax | The sum of all term weights in each term weighting vector is calculated and then N term weighting vectors with the N largest sum values are selected. Then the term weighting vector is constructed by choosing the highest term weighting value from the selected N term weighting vectors for each corresponding dimension's term weight. |
W-Max (global policy) |
Each term's value of term weighting vector will be replaced by the maximum value of the corresponding dimension's term weight in all categories. |
To evaluate classification performance of the proposed method, we choose the promising Support Vector Machines (SVM) learning algorithm in this study [23,24]. Although other algorithms such as k Nearest Neighbor, Decision Tree and Naive Bayes are also widely used, they are not included because the real number format of term weights could not be used except for the binary representation (see an exception in [25]). Finally, the SVM learning algorithm scale to large classification problems with thousands of features and examples.
The SVM is a relatively efficient machine learning algorithm which shows a good performance. It is based on the structural risk minimization principle from computational learning theory [26]. It can handle the large-scale and high-dimensional data sets with high classification accuracy. According to different kernel functions, SVMs are divided into two categories: nonlinear (such as polynomial, sigmoid function, radial-based function) and linear methods. Leopold and Kindermann pointed out that term weighting schemes dominate the performance of SVM classifiers rather than the kernel functions [27]. Moreover, the literatures have proven that the linear SVM is superior to nonlinear SVM [28]. So, we select the linear kernel function in this study. The other parameters of SVM are set to their default values. The SVM software used is from LIBSVM-3.14 [29].
The standard measures to determine the performance of a classification task are precision and recall [30,31]. However, it is well known that we may receive low recall when we obtain high precision. It will be ineffective if precision and recall are separated [1]. The widely used measure is F1 measure which combines the precision and recall. For a given category i, Precision, Recall and F1 measure are defined as follows.
Precisioni=TPi/(TPi+FPi) | (8) |
Recalli=TPi/(TPi+FNi) | (9) |
Fi1=(2∗Precisioni∗Recalli)/(Precisioni+Recalli) | (10) |
where TPi is the number of documents assigned correctly to class i, FPi is the number of documents that do not belong to class i but are assigned to class i. FNi is the number of documents that actually belong to class i but are not assigned to the class i.
The F1 measure is estimated by MicroF1 and MacroF1 [32]. In this study, MicroF1 and MacroF1 are employed to measure the performance of the proposed method. They are computed as in Eqs (11) and (12).
MicroF1=2∗Precision∗Recall | (11) |
MacroF1=1m∗∑mi=1Fi1 | (12) |
In Eq (10), m is the number of categories. The MicroF1 assigns equal weight to each document and it is considered as an average over all the document/category pairs. The MacroF1 assigns equal weight to each category and is influenced by the results of rare categories.
In order to verify the effectiveness of the proposed method in the experiment, besides tf*rf mentioned before, we also include the tf*or which is widely used in the previous study [33]. Its formula is expressed as
tf∗or=tf∗log(a∗db∗c) | (13) |
The main purpose of the experiments is to address the two questions, i.e., to explore the superiority of document representation strategy for supervised term weighting schemes and find a measure before choosing a document representation strategy. To accomplish this, we compare the methods on two popular benchmark data corpora, i.e., 20 Newsgroups and Reuters-21578, using SVM in terms of MicroF1 and MacroF1 measure. In addition to showing the corresponding results of DRGP in the experiment, the corresponding results of selecting other number of categories are also listed. The experimental results and discussion on these two corpora are in Sections 4.5.1–4.5.3.
The returned value of selectedC is 20 when DRGP is used on 20 Newsgroups which are weighted by tf*or term weighting method. Then we can use the formula (7) to calculate the term weighting vector of test set. Figure 1 shows the results of MicroF1 and MacroF1 when using supervised term weighting method tf*or and SVM classification algorithm with linear kernel functions on 20 Newsgroups. The best MicroF1 (0.7884) and MacroF1 (0.7837) are achieved by using DRGP and W-Max. The result obtained by D-Max is 0.5539 (MicroF1) and 0.5521 (MacroF1), and the result obtained by D-TMax is 0.6036 (MicroF1) and 0.6027 (MacroF1). It is more intuitive that the results and trends of MicroF1 and MacroF1 on the 20 newsgroups are basically the same. The main reason is that 20 Newsgroups is a balanced dataset.
The returned value of selectedC is 20 when DRGP is used on 20 Newsgroups which are weighted by tf*rf term weighting method. Then we can use the formula (7) to calculate the term weighting vector of test set. Figure 2 shows the results of MicroF1 and MacroF1 when using supervised term weighting method tf*rf and SVM classification algorithm with linear kernel functions on 20 Newsgroups. The best MicroF1 (0.7958) and MacroF1 (0.7909) are achieved by using DRGP and W-Max. The result obtained by D-Max is 0.7562 (MicroF1) and 0.7502 (MacroF1), and the result obtained by D-TMax is 0.7592 (MicroF1) and 0.7529 (MacroF1).
The returned value of selectedC is 6 when DRGP is used on Reuters-21578 which are weighted by tf*or term weighting method. Then we can use the Eq (7) to calculate the term weighting vector of test set. Figure 3 shows the results of MicroF1 and MacroF1 when using supervised term weighting method tf*or and SVM classification algorithm with linear kernel functions on Reuters-21578. In other words, the most appropriate N value for current dataset is 6. The sum of all term weights in each term weighting vector is calculated and then six term weighting vectors corresponding to the first six largest sum values are selected. Then the term weighting vector is constructed by choosing the highest term weighting value from the selected six term weighting vectors for each corresponding dimension's term weight. The best MicroF1 (0.9203) and MacroF1 (0.8919) are achieved by using DRGP. The result obtained by D-Max is 0.8769 (MicroF1) and 0.8284 (MacroF1), and the result obtained by D-TMax is 0.8697 (MicroF1) and 0.825 (MacroF1), and the result obtained by W-Max is 0.8898 (MicroF1) and 0.882 (MacroF1).
The returned value of selectedC is 6 when DRGP is used on Reuters-21578 which are weighted by tf*rf term weighting method. Then we can use the Eq (7) to calculate the term weighting vector of test set. Figure 4 shows the results of MicroF1 and MacroF1 when using supervised term weighting method tf*rf and SVM classification algorithm with linear kernel functions on Reuters-21578. The best MicroF1 (0.9283) and MacroF1 (0.9086) are achieved by using DRGP. The result obtained by D-Max is 0.9258 (MicroF1) and 0.9077 (MacroF1), and the result obtained by D-TMax is 0.9258 (MicroF1) and 0.8966 (MacroF1), and the result obtained by W-Max is 0.9269 (MicroF1) and 0.9049 (MacroF1).
The above results showed that DRGP and W-Max achieved the best performance on 20 Newsgroups while DRGP achieved the best performance on Reuters-21578. In addition, the results and trends of MicroF1 and MacroF1 on 20 Newsgroups are basically consistent. We think it is very natural results. Because 20 Newsgroups is a balanced dataset and Reuters-21578 is a skewed dataset. Many documents in Reuters-21578 have two or more labels. For example, if a document with two labels is represented by using only one class distribution between two labels, classifiers could has some difficulty to classify it in the other class.
In skewed data sets, it is not suitable to adopt the document representation strategy of D-Max, D-TMax, and W-Max (global policy). Note that although our experiments use the same corpus and same evaluation measure as Youngjoong Ko's [8], there are minor differences in data preparation such as the stemming or stop words lists for text preprocessing and the different feature selection measures. On Reuters-21578, the experimental results show the same phenomenon, that the results obtained by D-TMax are greater than D-Max, and the results obtained by W-Max method are the worst.
In the experimental results, we can observe that on the balanced data set (20 Newsgroups), MicroF1 and MacroF1 will increase with the increase of selection categories. However, the result of skewed data set (Reuters-21578) is that MicroF1 and MacroF1 first increase and then decrease with the increase of selection categories. Therefore, in order to obtain better results, it is necessary to select the appropriate method according to the characteristics of the data set. The proposed DRGP method has achieved best performance in both balanced and skewed data sets. Especially on skewed dataset, the results achieved are greatly improved compared to the results achieved by traditional methods. It should be noted that here we only use tf*or and tf*rf weighting methods and SVM classification method. Our method can be extended to other supervised term weighting methods, and other classifiers can be used for cross validation or classification.
Document representation is one of the most important parts for constructing a text classifier. For the supervised term weighting methods, in addition to the global policy, there is local policy, including D-Max, D-TMax and W-Max, etc. Faced with so many strategies, we are not sure how to choose to achieve the best results. That is, the N value in D-NMax cannot be determined.
In this study, we studied and solved this problem. We found that representation methods should be chosen according to corpora characteristics to have better classification performance. Compared with other methods, such as D-Max, D-TMax and W-Max, the method proposed in this study is not only a weighting method, but also a selection strategy. The method can select the appropriate N (in D-NMax) value according to distribution of each category in the dataset. By testing the proposed method on two representative supervised term weighting methods (tf*rf and tf*or) on two datasets (20 Newsgroups and Reuters-21578), it can be obtained that the method is effective on both balanced and unbalanced datasets.
Based on the original document representation method, the proposed method introduces the idea of traversal. Through cross validation, the proposed method first finds the optimal document representation model on the training set, and then applies the selected model to the test set. Due to the different working mechanism of different classifiers, different classifiers may have an impact on the final results, but the method proposed in this paper has no special requirements for classifiers. In this study, we use the SVM. This method can be extended to other supervised feature weighting methods, and other classifiers can be used for cross validation or classification. Compared with the original document representation methods, the proposed method increases the additional computational cost (cross validation on the test set). In the next research, we will continue to study and introduce new optimization methods to reduce the calculation cost.
At the end of this study, we answer the questions we raised in the beginning of the paper as conclusions:
1) Is there a relatively general strategy for supervised term weighting schemes, and, if yes, which one can achieve better performance?
Through a series of evaluations in text categorization, including balanced dataset and skewed dataset, we find that the performances of W-Max, D-Max, D-TMax get different results on different types of datasets. The proposed DRGP and W-Max achieved the best performance on 20 Newsgroups. Besides, the proposed DRGP achieved the best performance on Reuters-21578.
2) How can we know the effect of document representation method before we choose it? In other words, when selecting a document representation strategy for an unknown data set, which method is the best choice?
In this study, we propose the DRGP method, which can select appropriate representation strategy for an unknown dataset. No matter the data set is uniform or not, it will use traversal method on the constructed optimization function to find the document representation strategy on training set, then apply it on test set. The DRGP exhibit stable and consistent improvement over most of the previous document representations mentioned in the experiments.
This work was supported by the National Natural Science Foundation of China under Grant 61977015, and the Jilin Province Development and Reform Commission Project of China under Grant 2020C017-3. We would like to thank the organizations for their supports.
All authors declare no conflicts of interest in this paper.
[1] | M. Lan, S. Sung, H. Low, C. Tan, A comparative study on term weighting schemes for text categorization, in Proceedings 2005 IEEE International Joint Conference on Neural Networks, 1 (2005), 546–551. https://doi.org/10.1109/IJCNN.2005.1555890 |
[2] |
X. Li, A. Zhang, C. Li, J. Ouyang, Y. Cai, Exploring coherent topics by topic modeling with term weighting, Inf. Process. Manage., 54 (2018), 1345–1358. https://doi.org/10.1016/j.ipm.2018.05.009 doi: 10.1016/j.ipm.2018.05.009
![]() |
[3] |
M. Lan, C. Tan, J. Su, Y. Lu, Supervised and traditional term weighting methods for automatic text categorization, IEEE Trans. Pattern Anal. Mach. Intell., 31 (2008), 721–735. https://doi.org/10.1109/TPAMI.2008.110 doi: 10.1109/TPAMI.2008.110
![]() |
[4] |
E. H. Han, G. Karypis, V. Kumar, Text Categorization Using Weight Adjusted K-Nearest Neighbor Classification, Proc. Pacific Asia Conf. Knowl. Discovery Data Min., (2001), 53–65. https://doi.org/10.1007/3-540-45357-1_9 doi: 10.1007/3-540-45357-1_9
![]() |
[5] |
X. Quan, W. Liu, B. Qiu, Term weighting schemes for question categorization, IEEE Trans. Pattern Anal. Mach. Intell., 33 (2010), 1009–1021. https://doi.org/10.1109/TPAMI.2010.154 doi: 10.1109/TPAMI.2010.154
![]() |
[6] |
A. I. Kadhim, Survey on supervised machine learning techniques for automatic text classification, Artif. Intell. Rev., 51 (2019), 273–292. https://doi.org/10.1007/s10462-018-09677-1 doi: 10.1007/s10462-018-09677-1
![]() |
[7] |
M. M. Michał, J. Protasiewicz, A recent overview of the state-of-the-art elements of text classification, Expert Syst. Appl., 106 (2018), 36–54. https://doi.org/10.1016/j.eswa.2018.03.058 doi: 10.1016/j.eswa.2018.03.058
![]() |
[8] | C. Liu, Y. Sheng, Z. Wei, Y. Yang, Research of text classification based on improved TF-IDF algorithm, in 2018 IEEE International Conference of Intelligent Robotic and Control Engineering (IRCE), 2018. https://doi.org/10.1109/IRCE.2018.8492945 |
[9] | Y. Ko, A study of term weighting schemes using class information for text classification, in Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, 2012. https://doi.org/10.1145/2348283.2348453 |
[10] | M. Yurochkin, S. Claici, E. Chien, F. Mirzazadeh, J. Solomon, Hierarchical optimal transport for document representation, preprint, arXiv: abs/1906.10827. |
[11] |
W. Zhang, Y. Li, S. Wang, Learning document representation via topic-enhanced LSTM model, Knowl. Based Syst., 174 (2019), 194–204. https://doi.org/10.1016/J.KNOSYS.2019.03.007 doi: 10.1016/J.KNOSYS.2019.03.007
![]() |
[12] |
L. Li, B. Qin, W. Ren, T. Liu, Document representation and feature combination for deceptive spam review detection, Neurocomputing, 254 (2017), 33–41. https://doi.org/10.1016/j.neucom.2016.10.080 doi: 10.1016/j.neucom.2016.10.080
![]() |
[13] |
S. Deerwester, S. Dumais, G. Furnas, T. Landauer, R. Harshman, Indexing by latent semantic analysis, J. Am. Soc. Inf. Sci., 41 (1990), 391–407. https://doi.org/10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9 doi: 10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9
![]() |
[14] |
D. M. Blei, A. Ng, M. I. Jordan, Latent dirichlet allocation, J. Mach. Learn. Res., 3 (2003), 993–1022. https://doi.org/10.1016/B978-0-12-411519-4.00006-9 doi: 10.1016/B978-0-12-411519-4.00006-9
![]() |
[15] |
T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in vector space, Comput. Sci., 2013. https://doi.org/10.48550/arXiv.1301.3781 doi: 10.48550/arXiv.1301.3781
![]() |
[16] | Q. V. Le, T. Mikolov, Distributed representations of sentences and documents, Int. Conf. Mach. Learn. PMLR, 2014. |
[17] |
F. Sebastiani, Machine learning in automated text categorization, ACM Comput. Surv. (CSUR), 34 (2002), 1–47. https://doi.org/10.1145/505282.505283 doi: 10.1145/505282.505283
![]() |
[18] | L. Jia, B. Zhang, Optimal document representation strategy for supervised term weighting schemes in automatic text categorization, in 2019 9th International Conference on Information and Social Science, 2019. |
[19] |
Y. Q. Miao, M. Kamel, Pairwise optimized Rocchio algorithm for text categorization, Pattern Recogn. Lett., 32 (2011), 375–382. https://doi.org/10.1016/j.patrec.2010.09.018 doi: 10.1016/j.patrec.2010.09.018
![]() |
[20] |
C. Deng, X. He, Manifold adaptive experimental design for text categorization, IEEE Trans. Knowl. Data Eng., 24 (2011), 707–719. https://doi.org/10.1109/TKDE.2011.104 doi: 10.1109/TKDE.2011.104
![]() |
[21] | L. Man, C. L. Tan, H. B. Low, Proposing a new term weighting scheme for text categorization, AAAI, 6 (2006). |
[22] |
M. Revanasiddappa, B. Harish, A new feature selection method based on intuitionistic fuzzy entropy to categorize text documents, Int. J. Interact. Multim. Artif. Intell, 5 (2018), 106–117. https://doi.org/10.9781/ijimai.2018.04.002 doi: 10.9781/ijimai.2018.04.002
![]() |
[23] |
M. Goudjil, M. Koudil, M. Bedda, N. Ghoggali, A novel active learning method using SVM for text classification, Int. J. Autom. Comput., 15 (2018), 290–298. https://doi.org/10.1007/S11633-015-0912-Z doi: 10.1007/S11633-015-0912-Z
![]() |
[24] |
M. Haddoud, A. Mokhtari, T. Lecroq, Saïd Abdeddaïm Combining supervised term-weighting metrics for SVM text classification with extended term representation, Knowl. Inf. Syst., 49 (2016), 909–931. https://doi.org/10.1007/s10115-016-0924-1 doi: 10.1007/s10115-016-0924-1
![]() |
[25] | A. McCallum, K. Nigam, A comparison of event models for naive bayes text classification, in Proceeding AAAI Workshop Learning for Text Categorization, 1998. |
[26] |
Y. Yang, An evaluation of statistical approaches to text categorization, Inf. Retr., 1 (2004), 69–90. https://doi.org/10.1023/A:1009982220290 doi: 10.1023/A:1009982220290
![]() |
[27] |
E. Leopold, J. Kindermann, Text categorization with support vector machines. How to represent texts in input space, Mach. Learn., 46 (2002), 423–444. https://doi.org/10.1023/A:1012491419635 doi: 10.1023/A:1012491419635
![]() |
[28] |
S. Lee, K. Seo, Intelligent fault diagnosis based on a hybrid multi-class support vector machines and case-based reasoning approach, J. Comput. Theor. Nanosci., 10 (2013), 1727–1734. https://doi.org/10.1166/JCTN.2013.3116 doi: 10.1166/JCTN.2013.3116
![]() |
[29] | C. C. Chang, C. J. Lin, LIBSVM: a library for support vector machines, ACM Trans. Intell. Syst. Technol. (TIST), 2 (2011), 27:1–27:27. https://doi.org/10.1145/1961189.1961199 |
[30] |
J. Zhang, L. Chen, G. Guo, Projected-prototype based classifier for text categorization, Knowl.-Based Syst., 49 (2013), 179–189. https://doi.org/10.1016/j.knosys.2013.05.013 doi: 10.1016/j.knosys.2013.05.013
![]() |
[31] |
F. Ren, M. G. Sohrab, Class-indexing-based term weighting for automatic text classification, Inf. Sci., 236 (2013), 109–125. https://doi.org/10.1016/j.ins.2013.02.029 doi: 10.1016/j.ins.2013.02.029
![]() |
[32] |
I. Alsmadi, G. K. Hoon, Term weighting scheme for short-text classification: Twitter corpuses, Neural Comput. Appl., 31 (2019), 3819–3831. https://doi.org/10.1007/s00521-017-3298-8 doi: 10.1007/s00521-017-3298-8
![]() |
[33] |
Y. Ko, New feature weighting approaches for speech-act classification, Pattern Recogn. Lett., 51 (2015), 107–111. https://doi.org/10.1016/j.patrec.2014.08.014 doi: 10.1016/j.patrec.2014.08.014
![]() |
1. | Mingyang Zhong, Jiahui Wen, Jingwei Ma, Hao Cui, Qiuling Zhang, Morteza Karimzadeh Parizi, A hierarchical multi-leadership sine cosine algorithm to dissolving global optimization and data classification: The COVID-19 case study, 2023, 164, 00104825, 107212, 10.1016/j.compbiomed.2023.107212 | |
2. | Marcin Michał Mirończuk, Adam Müller, Witold Pedrycz, The Outcomes and Publication Standards of Research Descriptions in Document Classification: A Systematic Review, 2024, 12, 2169-3536, 189253, 10.1109/ACCESS.2024.3513550 |
tk | −tk | |
Positive Category: ci | a | b |
Positive Category: −ci | c | d |
category | document | t1 | t2 | t3 | t4 | t5 |
C1 | d1 | 0 | 0 | 2 | 19 | 3 |
C1 | d2 | 0 | 1 | 3 | 0 | 3 |
C1 | d3 | 0 | 0 | 5 | 16 | 2 |
C1 | d4 | 4 | 0 | 1 | 15 | 2 |
C1 | d5 | 0 | 0 | 2 | 18 | 3 |
C2 | d6 | 0 | 0 | 2 | 14 | 3 |
C2 | d7 | 0 | 1 | 3 | 0 | 3 |
C2 | d8 | 0 | 0 | 5 | 13 | 2 |
C2 | d9 | 4 | 0 | 1 | 11 | 2 |
C2 | d10 | 0 | 0 | 2 | 17 | 3 |
C3 | d11 | 0 | 0 | 3 | 0 | 3 |
C3 | d12 | 0 | 0 | 1 | 1 | 3 |
C3 | d13 | 0 | 1 | 1 | 0 | 2 |
C4 | d14 | 1 | 99 | 3 | 2 | 1 |
C4 | d15 | 2 | 99 | 1 | 1 | 1 |
C4 | d16 | 1 | 99 | 1 | 2 | 1 |
C5 | d17 | 4 | 0 | 3 | 0 | 3 |
C5 | d18 | 4 | 0 | 1 | 0 | 3 |
C5 | d19 | 4 | 1 | 1 | 0 | 2 |
category | t1 | t2 | t3 | t4 | t5 |
C1 | 1.1155 | 1.1699 | 1.2538 | 1.3626 | 1.2538 |
C2 | 1.1155 | 1.1699 | 1.2538 | 1.3626 | 1.2538 |
C3 | 1.0000 | 1.1699 | 1.1375 | 1.0704 | 1.1375 |
C4 | 1.2630 | 1.4510 | 1.0875 | 1.1520 | 1.0875 |
C5 | 1.4594 | 1.0000 | 1.1375 | 1.0000 | 1.1375 |
method | t1 | t2 | t3 | t4 | t5 |
D-Max | 1.1155 | 1.1699 | 1.2538 | 1.3626 | 1.2538 |
D-TMax | 1.1155 | 1.1699 | 1.2538 | 1.3626 | 1.2538 |
D-3Max | 1.2630 | 1.4510 | 1.2538 | 1.3626 | 1.2538 |
D-4Max | 1.4594 | 1.4510 | 1.2538 | 1.3626 | 1.2538 |
D-5Max | 1.4594 | 1.4510 | 1.2538 | 1.3626 | 1.2538 |
Datasets | # of documents | distinct words | # of classes | # of training | # of testing |
20 Newsgroups | 18,846 | 26,214 | 20 | 11,314 | 7532 |
Reuters21578 | 7285 | 18,221 | 10 | 5228 | 2057 |
Denoted by | Description |
D-Max | The sum of all term weights in each term weighting vector is first calculated and then one term weighting vector with the maximum sum value is selected as the document representation vector. |
D-TMax | The sum of all term weights in each term weighting vector is calculated and then two term weighting vectors with the two largest sum values are selected. Then the term weighting vector is constructed by choosing the higher term weighting value from the selected two term weighting vectors for each corresponding dimension's term weight. |
D-NMax | The sum of all term weights in each term weighting vector is calculated and then N term weighting vectors with the N largest sum values are selected. Then the term weighting vector is constructed by choosing the highest term weighting value from the selected N term weighting vectors for each corresponding dimension's term weight. |
W-Max (global policy) |
Each term's value of term weighting vector will be replaced by the maximum value of the corresponding dimension's term weight in all categories. |
tk | −tk | |
Positive Category: ci | a | b |
Positive Category: −ci | c | d |
category | document | t1 | t2 | t3 | t4 | t5 |
C1 | d1 | 0 | 0 | 2 | 19 | 3 |
C1 | d2 | 0 | 1 | 3 | 0 | 3 |
C1 | d3 | 0 | 0 | 5 | 16 | 2 |
C1 | d4 | 4 | 0 | 1 | 15 | 2 |
C1 | d5 | 0 | 0 | 2 | 18 | 3 |
C2 | d6 | 0 | 0 | 2 | 14 | 3 |
C2 | d7 | 0 | 1 | 3 | 0 | 3 |
C2 | d8 | 0 | 0 | 5 | 13 | 2 |
C2 | d9 | 4 | 0 | 1 | 11 | 2 |
C2 | d10 | 0 | 0 | 2 | 17 | 3 |
C3 | d11 | 0 | 0 | 3 | 0 | 3 |
C3 | d12 | 0 | 0 | 1 | 1 | 3 |
C3 | d13 | 0 | 1 | 1 | 0 | 2 |
C4 | d14 | 1 | 99 | 3 | 2 | 1 |
C4 | d15 | 2 | 99 | 1 | 1 | 1 |
C4 | d16 | 1 | 99 | 1 | 2 | 1 |
C5 | d17 | 4 | 0 | 3 | 0 | 3 |
C5 | d18 | 4 | 0 | 1 | 0 | 3 |
C5 | d19 | 4 | 1 | 1 | 0 | 2 |
category | t1 | t2 | t3 | t4 | t5 |
C1 | 1.1155 | 1.1699 | 1.2538 | 1.3626 | 1.2538 |
C2 | 1.1155 | 1.1699 | 1.2538 | 1.3626 | 1.2538 |
C3 | 1.0000 | 1.1699 | 1.1375 | 1.0704 | 1.1375 |
C4 | 1.2630 | 1.4510 | 1.0875 | 1.1520 | 1.0875 |
C5 | 1.4594 | 1.0000 | 1.1375 | 1.0000 | 1.1375 |
method | t1 | t2 | t3 | t4 | t5 |
D-Max | 1.1155 | 1.1699 | 1.2538 | 1.3626 | 1.2538 |
D-TMax | 1.1155 | 1.1699 | 1.2538 | 1.3626 | 1.2538 |
D-3Max | 1.2630 | 1.4510 | 1.2538 | 1.3626 | 1.2538 |
D-4Max | 1.4594 | 1.4510 | 1.2538 | 1.3626 | 1.2538 |
D-5Max | 1.4594 | 1.4510 | 1.2538 | 1.3626 | 1.2538 |
Datasets | # of documents | distinct words | # of classes | # of training | # of testing |
20 Newsgroups | 18,846 | 26,214 | 20 | 11,314 | 7532 |
Reuters21578 | 7285 | 18,221 | 10 | 5228 | 2057 |
Denoted by | Description |
D-Max | The sum of all term weights in each term weighting vector is first calculated and then one term weighting vector with the maximum sum value is selected as the document representation vector. |
D-TMax | The sum of all term weights in each term weighting vector is calculated and then two term weighting vectors with the two largest sum values are selected. Then the term weighting vector is constructed by choosing the higher term weighting value from the selected two term weighting vectors for each corresponding dimension's term weight. |
D-NMax | The sum of all term weights in each term weighting vector is calculated and then N term weighting vectors with the N largest sum values are selected. Then the term weighting vector is constructed by choosing the highest term weighting value from the selected N term weighting vectors for each corresponding dimension's term weight. |
W-Max (global policy) |
Each term's value of term weighting vector will be replaced by the maximum value of the corresponding dimension's term weight in all categories. |