A high dimensional and large sample categorical data set with a response variable may have many noninformative or redundant categories in its explanatory variables. Identifying and removing these categories usually improve the association but also give rise to significantly higher statistical reliability of selected features. A category-based probabilistic approach is proposed to achieve this goal. Supportive experiments are presented.
Citation: Jianguo Dai, Wenxue Huang, Yuanyi Pan. 2018: A category-based probabilistic approach to feature selection, Big Data and Information Analytics, 3(1): 14-21. doi: 10.3934/bdia.2017020
A high dimensional and large sample categorical data set with a response variable may have many noninformative or redundant categories in its explanatory variables. Identifying and removing these categories usually improve the association but also give rise to significantly higher statistical reliability of selected features. A category-based probabilistic approach is proposed to achieve this goal. Supportive experiments are presented.
[1] | Daly A., Dekker T., Hess S. (2014) Dummy coding vs effects coding for categorical variables: Clarifications and extensions. J. Choice Modelling 21: 36-41. doi: 10.1016/j.jocm.2016.09.005 |
[2] | S. Garavaglia and A. Sharma, A Smart Guide to Dummy Variables: Four Applications and a Macro, 1998. |
[3] | Gokhale S. S. (2004) Quantifying the variance in application reliability. IEEE Pacific Rim International Symposium on Dependable Computing 113-121. doi: 10.1109/PRDC.2004.1276562 |
[4] |
L. A. Goodman and W. H. Kruskal, Measures of Association for Cross Classifications, Springer-Verlag, New York-Berlin, 1979.
MR553108 |
[5] | Guttman L. (1946) The test-retest reliability of qualitative data. Psychometrika 11: 81-95. doi: 10.1007/BF02288925 |
[6] | Huang W., Li X., Pan Y. (2016) Increase statistical reliability without lossing predictive power by merging classes and adding variables. Big Data and Information Analytics 1: 341-347. |
[7] | Huang W., Pan Y. (2016) On balancing between optimal and proportional categorical predictions. Big Data and Information Analytics 1: 129-137. doi: 10.3934/bdia.2016.1.129 |
[8] | Huang W., Shi Y., Wang X. (2017) A nominal association matrix with feature selection for categorical data. Communications in Statistics -Theory and Methods 46: 7798-7819. doi: 10.1080/03610926.2014.930911 |
[9] | S. Kamenshchikov, Variance Ratio as a Measure of Backtest Reliability, Futures, 2015. |
[10] |
J. Li, K. Cheng, et. al., Feature selection: A data perspective, arXiv: 1601.07996v4, [cs.LG], ACM Computing Surveys 50 (2018), Article No. 94.
10.1145/3136625 |
[11] |
C. J. Lloyd, Statistical Analysis of Categorical Data, John Wiley & Sons, Inc., New York, 1999.
MR1682513 |
[12] | STATCAN, 1998. Survey of Family Expenditures-1996. |
[13] | http://archive.ics.uci.edu/ml/datasets/Mushroom |