A category-based probabilistic approach to feature selection

Jianguo Dai; Wenxue Huang; Yuanyi Pan; Jianguo Dai; Wenxue Huang; Yuanyi Pan

doi:10.3934/bdia.2017020

Big Data and Information Analytics

2018, Volume 3, Issue 1: 14-21. doi: 10.3934/bdia.2017020

Previous Article Next Article

A category-based probabilistic approach to feature selection

Jianguo Dai ^{1
,},
Wenxue Huang ^{1
,},
Yuanyi Pan ^{2
,}

1.
School of Mathematics and Information Sciences, Guangzhou University, Guangzhou 510006, China
2.
Clearpier Inc., 1300-121 Richmond St.W., Toronto, Ontario M5H 2K1, Canada

Published: 01 November 2017
Primary: 62H20, 62F07; Secondary: 68T30, 58F17

A high dimensional and large sample categorical data set with a response variable may have many noninformative or redundant categories in its explanatory variables. Identifying and removing these categories usually improve the association but also give rise to significantly higher statistical reliability of selected features. A category-based probabilistic approach is proposed to achieve this goal. Supportive experiments are presented.
- Association,
- categorical data,
- feature selection,
- statistical reliability
Citation: Jianguo Dai, Wenxue Huang, Yuanyi Pan. 2018: A category-based probabilistic approach to feature selection, Big Data and Information Analytics, 3(1): 14-21. doi: 10.3934/bdia.2017020

Related Papers:

Abstract

A high dimensional and large sample categorical data set with a response variable may have many noninformative or redundant categories in its explanatory variables. Identifying and removing these categories usually improve the association but also give rise to significantly higher statistical reliability of selected features. A category-based probabilistic approach is proposed to achieve this goal. Supportive experiments are presented.

References

[1]	Daly A., Dekker T., Hess S. (2014) Dummy coding vs effects coding for categorical variables: Clarifications and extensions. J. Choice Modelling 21: 36-41. doi: 10.1016/j.jocm.2016.09.005
[2]	S. Garavaglia and A. Sharma, A Smart Guide to Dummy Variables: Four Applications and a Macro, 1998.
[3]	Gokhale S. S. (2004) Quantifying the variance in application reliability. IEEE Pacific Rim International Symposium on Dependable Computing 113-121. doi: 10.1109/PRDC.2004.1276562
[4]	L. A. Goodman and W. H. Kruskal, Measures of Association for Cross Classifications, Springer-Verlag, New York-Berlin, 1979. MR553108
[5]	Guttman L. (1946) The test-retest reliability of qualitative data. Psychometrika 11: 81-95. doi: 10.1007/BF02288925
[6]	Huang W., Li X., Pan Y. (2016) Increase statistical reliability without lossing predictive power by merging classes and adding variables. Big Data and Information Analytics 1: 341-347.
[7]	Huang W., Pan Y. (2016) On balancing between optimal and proportional categorical predictions. Big Data and Information Analytics 1: 129-137. doi: 10.3934/bdia.2016.1.129
[8]	Huang W., Shi Y., Wang X. (2017) A nominal association matrix with feature selection for categorical data. Communications in Statistics -Theory and Methods 46: 7798-7819. doi: 10.1080/03610926.2014.930911
[9]	S. Kamenshchikov, Variance Ratio as a Measure of Backtest Reliability, Futures, 2015.
[10]	J. Li, K. Cheng, et. al., Feature selection: A data perspective, arXiv: 1601.07996v4, [cs.LG], ACM Computing Surveys 50 (2018), Article No. 94. 10.1145/3136625
[11]	C. J. Lloyd, Statistical Analysis of Categorical Data, John Wiley & Sons, Inc., New York, 1999. MR1682513
[12]	STATCAN, 1998. Survey of Family Expenditures-1996.
[13]	http://archive.ics.uci.edu/ml/datasets/Mushroom

Reader Comments

Your name:*

Email:*
© 2018 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

Big Data and Information Analytics

Metrics

Article views(3726) PDF downloads(1746) Cited by(0)

Preview PDF

Download XML

Export Citation

Article outline

Show full outline

Figures and Tables

Tables(4)

Big Data and Information Analytics

A category-based probabilistic approach to feature selection

Related Papers:

Abstract

References

Reader Comments

通讯作者: 陈斌, bchen63@163.com

Metrics

Figures and Tables

Other Articles By Authors

Catalog

Big Data and Information Analytics

A category-based probabilistic approach to feature selection

Related Papers:

Abstract

References

Reader Comments

通讯作者: 陈斌, bchen63@163.com

Metrics

Figures and Tables

Other Articles By Authors

Related pages

Tools

Export File

Citation

Format

Content

Catalog