A category-based probabilistic approach to feature selection

  • Published: 01 November 2017
  • Primary: 62H20, 62F07; Secondary: 68T30, 58F17

  • A high dimensional and large sample categorical data set with a response variable may have many noninformative or redundant categories in its explanatory variables. Identifying and removing these categories usually improve the association but also give rise to significantly higher statistical reliability of selected features. A category-based probabilistic approach is proposed to achieve this goal. Supportive experiments are presented.

    Citation: Jianguo Dai, Wenxue Huang, Yuanyi Pan. 2018: A category-based probabilistic approach to feature selection, Big Data and Information Analytics, 3(1): 14-21. doi: 10.3934/bdia.2017020

    Related Papers:

  • A high dimensional and large sample categorical data set with a response variable may have many noninformative or redundant categories in its explanatory variables. Identifying and removing these categories usually improve the association but also give rise to significantly higher statistical reliability of selected features. A category-based probabilistic approach is proposed to achieve this goal. Supportive experiments are presented.



    加载中
    [1] Daly A., Dekker T., Hess S. (2014) Dummy coding vs effects coding for categorical variables: Clarifications and extensions. J. Choice Modelling 21: 36-41. doi: 10.1016/j.jocm.2016.09.005
    [2] S. Garavaglia and A. Sharma, A Smart Guide to Dummy Variables: Four Applications and a Macro, 1998.
    [3] Gokhale S. S. (2004) Quantifying the variance in application reliability. IEEE Pacific Rim International Symposium on Dependable Computing 113-121. doi: 10.1109/PRDC.2004.1276562
    [4] L. A. Goodman and W. H. Kruskal, Measures of Association for Cross Classifications, Springer-Verlag, New York-Berlin, 1979.

    MR553108

    [5] Guttman L. (1946) The test-retest reliability of qualitative data. Psychometrika 11: 81-95. doi: 10.1007/BF02288925
    [6] Huang W., Li X., Pan Y. (2016) Increase statistical reliability without lossing predictive power by merging classes and adding variables. Big Data and Information Analytics 1: 341-347.
    [7] Huang W., Pan Y. (2016) On balancing between optimal and proportional categorical predictions. Big Data and Information Analytics 1: 129-137. doi: 10.3934/bdia.2016.1.129
    [8] Huang W., Shi Y., Wang X. (2017) A nominal association matrix with feature selection for categorical data. Communications in Statistics -Theory and Methods 46: 7798-7819. doi: 10.1080/03610926.2014.930911
    [9] S. Kamenshchikov, Variance Ratio as a Measure of Backtest Reliability, Futures, 2015.
    [10] J. Li, K. Cheng, et. al., Feature selection: A data perspective, arXiv: 1601.07996v4, [cs.LG], ACM Computing Surveys 50 (2018), Article No. 94.

    10.1145/3136625

    [11] C. J. Lloyd, Statistical Analysis of Categorical Data, John Wiley & Sons, Inc., New York, 1999.

    MR1682513

    [12] STATCAN, 1998. Survey of Family Expenditures-1996.
    [13] http://archive.ics.uci.edu/ml/datasets/Mushroom
  • Reader Comments
  • © 2018 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(3089) PDF downloads(1745) Cited by(0)

Article outline

Figures and Tables

Tables(4)

Other Articles By Authors

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog