Multiple-instance learning for text categorization based on semantic representation

Jian-Bing Zhang; Yi-Xin Sun; De-Chuan Zhan; Jian-Bing Zhang; Yi-Xin Sun; De-Chuan Zhan

doi:10.3934/bdia.2017009

Big Data and Information Analytics

2017, Volume 2, Issue 1: 69-75. doi: 10.3934/bdia.2017009

Previous Article Next Article

Multiple-instance learning for text categorization based on semantic representation

National Key Laboratory for Novel Software Technology, Nanjing University, China

Published: 01 January 2017
97R40

Text categorization is the fundamental bricks of other related researches in NLP. Up to now, researchers have proposed many effective text categorization methods and gained well performance. However, these methods are generally based on the raw features or low level features, e.g., tf or tfidf, while neglecting the semantic structures between words. Complex semantic information can influence the precision of text categorization. In this paper, we propose a new method to handle the semantic correlations between different words and text features from the representations and the learning schemes. We represent the document as multiple instances based on word2vec. Experiments validate the effectiveness of proposed method compared with those state-of-the-art text categorization methods.
- Text categorization,
- text representation,
- Multiple-Instance learning,
- mi-SVM,
- word2vec
Citation: Jian-Bing Zhang, Yi-Xin Sun, De-Chuan Zhan. 2017: Multiple-instance learning for text categorization based on semantic representation, Big Data and Information Analytics, 2(1): 69-75. doi: 10.3934/bdia.2017009

Related Papers:

Abstract

Text categorization is the fundamental bricks of other related researches in NLP. Up to now, researchers have proposed many effective text categorization methods and gained well performance. However, these methods are generally based on the raw features or low level features, e.g., tf or tfidf, while neglecting the semantic structures between words. Complex semantic information can influence the precision of text categorization. In this paper, we propose a new method to handle the semantic correlations between different words and text features from the representations and the learning schemes. We represent the document as multiple instances based on word2vec. Experiments validate the effectiveness of proposed method compared with those state-of-the-art text categorization methods.

References

[1]	Amores J. (2013) Multiple instance classification: Review, taxonomy and comparative study. Artificial Intelligence 201: 81-105. doi: 10.1016/j.artint.2013.06.003
[2]	Andrews S., Tsochantaridis I., Hofmann T. (2002) Support vector machines for multiple-instance learning. Advances in Neural Information Processing Systems 15: 561-568.
[3]	Cavnar W.B., Trenkle J.M., et al. (1994) N-gram-based text categorization. Ann Arbor MI 48113: 161-175.
[4]	Y. Chevaleyre and J. D. Zucker, Solving multiple-instance and multiple-part learning problems with decision trees and rule sets. application to the mutagenesis problem, In Biennial Conference of the Canadian Society on Computational Studies of Intelligence: Advances in Artificial Intelligence, (2001), 204–214. 10.1007/3-540-45153-6_20
[5]	Dietterich T.G., Lathrop R.H., Lozano-Pérez T. (1997) Solving the multiple instance problem with axis-parallel rectangles. Artificial Intelligence 89: 31-71. doi: 10.1016/S0004-3702(96)00034-3
[6]	Dumais S. (1998) Using svms for text categorization. IEEE Expert 13: 21-23.
[7]	N. Ishii, T. Murai, T. Yamada and Y. Bao, Text classification by combining grouping, lsa and knn, In Ieee/acis International Conference on Computer and Information Science and Ieee/acis International Workshop on Component-Based Software Engineering, software Architecture and Reuse, (2006), 148–154. 10.1109/ICIS-COMSAR.2006.81
[8]	Kuang Q., Xu X. (2010) Improvement and application of tfidf method based on text classification. International Conference on Internet Technology and Applications 1-4.
[9]	Lai S., Xu L., Liu K., Zhao J. (2015) Recurrent convolutional neural networks for text classification. AAAI 2267-2273.
[10]	Maron O., Lozano-Pérez T. (1998) A framework for multiple-instance learning. Advances in Neural Information Processing Systems 200: 570-576.
[11]	Mccallum A., Nigam K. (2009) A comparison of event models for naive bayes text classification. In AAAI-98 Workshop On Learning For Text Categorization 62: 41-48.
[12]	T. Mikolov, K. Chen, G. Corrado and J. Dean, Efficient estimation of word representations in vector space, Computer Science, 2013.
[13]	Mikolov T., Sutskever I., Chen K., Corrado G., Dean J. (2013) Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems 26: 3111-3119.
[14]	Wang J., Zucker J.D. (2000) Solving multiple-instance problem: A lazy learning approach. Proc.international Conf.on Machine Learning 1119-1126.
[15]	Zhang M.L., Zhou Z.H. (2004) Improve multi-instance neural networks through feature selection. Neural Processing Letters 19: 1-10. doi: 10.1023/B:NEPL.0000016836.03614.9f
[16]	Z. H. Zhou and M. L. Zhang, Neural networks for multi-instance learning, In International Conference on Intelligent Information Technology 2002.

Reader Comments

Your name:*

Email:*
© 2017 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)