Citation: Wenxue Huang, Xiaofeng Li, Yuanyi Pan. Increase statistical reliability without losing predictive power by merging classes and adding variables[J]. Big Data and Information Analytics, 2016, 1(4): 341-348. doi: 10.3934/bdia.2016014
[1] | Jianguo Dai, Wenxue Huang, Yuanyi Pan . A category-based probabilistic approach to feature selection. Big Data and Information Analytics, 2018, 3(1): 14-21. doi: 10.3934/bdia.2017020 |
[2] | Wenxue Huang, Yuanyi Pan . On Balancing between Optimal and Proportional categorical predictions. Big Data and Information Analytics, 2016, 1(1): 129-137. doi: 10.3934/bdia.2016.1.129 |
[3] | Dongyang Yang, Wei Xu . Statistical modeling on human microbiome sequencing data. Big Data and Information Analytics, 2019, 4(1): 1-12. doi: 10.3934/bdia.2019001 |
[4] | Wenxue Huang, Qitian Qiu . Forward Supervised Discretization for Multivariate with Categorical Responses. Big Data and Information Analytics, 2016, 1(2): 217-225. doi: 10.3934/bdia.2016005 |
[5] | Jason Adams, Yumou Qiu, Luis Posadas, Kent Eskridge, George Graef . Phenotypic trait extraction of soybean plants using deep convolutional neural networks with transfer learning. Big Data and Information Analytics, 2021, 6(0): 26-40. doi: 10.3934/bdia.2021003 |
[6] | Marco Tosato, Jianhong Wu . An application of PART to the Football Manager data for players clusters analyses to inform club team formation. Big Data and Information Analytics, 2018, 3(1): 43-54. doi: 10.3934/bdia.2018002 |
[7] | Nickson Golooba, Woldegebriel Assefa Woldegerima, Huaiping Zhu . Deep neural networks with application in predicting the spread of avian influenza through disease-informed neural networks. Big Data and Information Analytics, 2025, 9(0): 1-28. doi: 10.3934/bdia.2025001 |
[8] | Sunmoo Yoon, Maria Patrao, Debbie Schauer, Jose Gutierrez . Prediction Models for Burden of Caregivers Applying Data Mining Techniques. Big Data and Information Analytics, 2017, 2(3): 209-217. doi: 10.3934/bdia.2017014 |
[9] | Ricky Fok, Agnieszka Lasek, Jiye Li, Aijun An . Modeling daily guest count prediction. Big Data and Information Analytics, 2016, 1(4): 299-308. doi: 10.3934/bdia.2016012 |
[10] | Amanda Working, Mohammed Alqawba, Norou Diawara, Ling Li . TIME DEPENDENT ATTRIBUTE-LEVEL BEST WORST DISCRETE CHOICE MODELLING. Big Data and Information Analytics, 2018, 3(1): 55-72. doi: 10.3934/bdia.2018010 |
In any applications where feature selection or dimension reduction is required, a key question to be answered is how many variables or features are enough. More variables may increase data based association degree but may also result in explanatory information reliability reduction or model over fitting. It is particularly important for a stepwise forward feature selection procedure [8] to decide when to stop the variable aggregation. It can be stopped when the maximum joint association or the predefined maximum number of variables is reached. More discussions about this subject can be found in [2].
The prediction accuracy naturally attracts most of the attention and has been studied for hundreds of years. Categorical data analysis alone has the rate of point-hit accuracy, of distribution bias and of the balanced one between them [9]. Huang, Shi and Wang [12] suggested that the measure of association is fundamental to obtain the prediction accuracy rate and that this measure will increase as more explanatory variables added in that probabilistic model [12].
The risk of model failure, or the model's reliability, is usually related to the average number of categories in a categorical predictive model. Guttman [7] presented methods to estimate the upper and lower bounds to a categorical data set's reliability. These estimates are functions of the number of categories available and the proportion of instance from which the model response is chosen. Probably the most generally applicable and widely used method for estimating the reliability of rating or judgment is with the intra-class correlation, or some variation of it [3]. However, none of these methods reflect the response variable's distribution.
We hence introduce a new measure, denoted as
We also prove that the association between the merged independent variable and the target variable keeps exactly the same after the merge if the merged independent classes have the same condition probabilities. Thus, we believe that the solution to the dilemma of the association increase and the reliability decrease along the feature selection process is to merge categories with similar conditional probabilities before adding new variables.
This article is organized as follows. Section 2 presents the definitions for the association and the reliability measures; section 3 discusses how and why the independent classes are merged; two supportive experiments are analyzed in section 4; the last section is a brief summarization and discussion to the future work.
Given a nominal categorical data set with one independent variable
λ=∑xρxm−ρ⋅m1−ρ⋅m, |
where
ρ⋅m=maxyρ⋅y=maxyp(Y=y), ρxm=maxyρxy=maxyp(X=x;Y=y). |
Please note that
τ=∑x∑yρxy2/ρx⋅−∑yρ⋅y21−∑yρ⋅y2, |
where
ρx⋅=p(X=x). |
Going by "precision" in some publications, reliability may be ambiguous in certain cases [17]. But in our context, it is how much a probability model built upon a given nominal categorical data set may fail in predicting the unknowns hence the number of classes of the independent variable, or the expected one, approximately shows the model's reliability. The expected number of classes in a variable
Ep(X)=∑xρx⋅2 |
Roughly speaking, the more independent classes a predictive model has, the less support each conditional probability has in a given data set with limited size hence the less reliable the constructed model is. However, this measure does not count adequately the target variable's distribution. We believe it is more appropriate to construct one that considers the concentration of the independent values in each dependent class. Here then comes our proposed measure of reliability of explanatory information
E(Gini(X|Y))=1−∑x∑yρxy2ρ⋅y=1−∑x∑yp(X=x,Y=y)p(X=x|Y=y), |
which is nothing but the average number of independent classes within each dependent class.
It is easy to see that the value of
To decide which independent classes to be merged, a category-to-variable measure is required to estimate each independent class' overall predictive power to the target variable. This new measure to all element pairs in
Φ(Y|X)=(ϕst(Y|X)), |
where
ϕst(Y|X)=∑y(ρsyρs⋅−ρtyρt⋅)2ρ⋅y;s,t∈Dmn(X). |
Thus, the (s, t)-entry in
1.
2. The value in the diagonal entries is zero.
3. The smaller
In addition, when we want to merge the classes in two independent variables, denoted as
ϕijkl(Y|X1,X2)=Σy(ρijyρij⋅−ρklyρkl⋅)2ρ⋅y for i,k=1,2,⋯,nx1, and for j,l=1,2,⋯,nx2. |
Thus, the measure of association,
The proportional prediction based association measure
τ=ωY|X−Ep(Y)1−Ep(Y), |
where
ωY|X=∑x∑yρxy2ρx⋅ and Ep(Y)=∑yρ⋅y2. |
Thus,
We also have the following theorem to explain why merging the nominal classes works.
Theorem 3.1. If the conditional probabilities of
ρsyρs.=ρtyρt.=ay, for y=1,2,…,ny, |
then merging the classes
ωY|X=ωY|X′. |
Proof. Let
ωY|X′=∑x≠s,t∑yρ2xyρx⋅+∑yρ2myρm⋅, |
where
Because
∑yρ2myρm⋅=∑y(ρsy+ρty)2ρs⋅+ρt⋅=∑y(ayρs⋅+ayρt⋅)2ρs⋅+ρt⋅=∑ya2y(ρs⋅+ρt⋅)=∑ya2yρs⋅+∑ya2yρt⋅=∑yρ2syρ2s⋅ρs⋅+∑yρ2tyρ2t⋅ρt⋅=∑yρ2syρs⋅+∑yρ2tyρt⋅, |
we have
∑x≠s,t∑yρ2xyρx⋅+∑yρ2myρm⋅=∑x≠s,t∑yρ2xyρx⋅+∑y(ρ2syρs⋅+ρ2tyρt⋅), |
that is
ωY|X=ωY|X′. |
Thus,
On the other hand, when the conditional probabilities of
Both experiments use the 1996 Survey of Family Expenditure administrated by The Statistics Canada [16]. It has
The first result shows how the reliability and the association degrees are changed when Sex is added to Age group with Occupation as the target variable. The result briefly demonstrate how a regular feature selection process without merging works. It is also going to be used as the baseline to evaluate the performance after the merge.
As discussed above, the added variable Sex increases the association, measured by
Knowing that Age group has
| 0.1484 | 0.0375 | 0.6688 |
(Age group'+Sex)'+Education' | 0.1542 | 0.0447 | 0.6620 |
Table 2 tells us that the merged Age Group, combined with Sex is better in reliability given the smaller
Age group | 0.1344 | 0.0311 | 0.8773 |
Age group + Sex | 0.1511 | 0.0476 | 0.9228 |
It is clear that the merging threshold determines how many classes will be merged therefore affects the quality. Table 3 shows some simple analysis.
| ||||
Age group | - | 0.0311 | 0.1344 | 0.8773 |
| 0.0005 | 0.0414 | 0.1493 | 0.9222 |
| 0.0030 | 0.0375 | 0.1484 | 0.6688 |
| 0.0100 | 0.0000 | 0.0209 | 0.2710 |
As Table 3 suggests, the bigger the merging threshold is, the more classes are merged then the higher the reliability is while the lower the association is. One can tune this parameter to achieve the needed result given certain trade-off considerations. The chosen ones in this article come from practical considerations than theoretical optima.
Following similar steps to those the previous section presents with different variable sets, we consider House type as the target variable and investigate the effect of merging. Please note that the threshold is still
Table 4 also shows us an example of not only the better reliability but also the higher association after two merged variables are combined.
Rooms | 0.3443598 | 0.3004656 | 0.8200656 |
| 0.4255117 | 0.3583277 | 0.7911177 |
| 0.4381247 | 0.3901767 | 0.7165204 |
Based on the theory of association measure and the Gini coefficient, we take
[1] | [ H. L. Costner, Criteria for measure of association, American Sociology Review, 30(1965), 341-353. |
[2] | [ M. Dash and H. Liu, Feature selection for classification, Intell. Data. Anal., 1(1997), 131-156. |
[3] | [ R. L. Ebel, Estimation of the reliability of ratings, Psychomereika, 16(1951), 407-424. |
[4] | [ G. S. Fisher, Monte Carlo:Concepts, Algorithms, and Applications, Springer-Verlag, 1996. |
[5] | [ P. Glasserman, Monte Carlo Method in Financial Engineering, (Stochastic Modelling and Applied Probability) (V. 53), Spinger, 2004. |
[6] | [ L. A. Goodman and W. H. Kruskal, Measures of Associations for Cross Classification, With a foreword by Stephen E. Fienberg. Springer Series in Statistics, 1. Springer-Verlag, New YorkBerlin, 1979. |
[7] | [ L. Guttman, The test-retest reliability of qualitative data, Psychometrika, 11(1946), 81-95. |
[8] | [ I. Guyon and A. Elisseeff, An introduction to variable and feature selection, J. Mach. Learn. Res., 3(2003), 1157-1182. |
[9] | [ W. Huang and Y. Pan, On balancing between optimal and proportional categorical predictions, Big Data and Info. Anal., 1(2016), 129-137. |
[10] | [ W. Huang, Y. Pan and J. Wu, Supervised Discretization with GK -τ, Proc. Comp. Sci., 17(2013), 114-120. |
[11] | [ W. Huang, Y. Pan and J. Wu, Supervised discretization for optimal prediction, Proc. Comp. Sci., 30(2014), 75-80. |
[12] | [ W. Huang, Y. Shi and X. Wang, A nominal association matrix with feature selection for categorical data, Communications in Statistics -Theory and Methods, 2017. |
[13] | [ M. G. Kendall, The Advanced Theory of Statistics, London, Charles Griffin and Co., Ltd, 1946. |
[14] | [ C. J. Lloyd, Statistical Analysis of Categorical Data, John Wiley Sons, 1999. |
[15] | [ K. Pearson and D. Heron, On Theories of association, Biometrika, 9(1913), 159-315. |
[16] | [ STATCAN, Survey of Family Expenditures-1996. (1998) |
[17] | [ D. L. Streiner and G. R. Norman, "Precision" and "accuracy":Two terms that are neither, J. of Cli. Epid., 59(2006), 327-330. |
1. | Jianguo Dai, Wenxue Huang, Yuanyi Pan, A category-based probabilistic approach to feature selection, 2017, 3, 2380-6974, 14, 10.3934/bdia.2017020 |
| 0.1484 | 0.0375 | 0.6688 |
(Age group'+Sex)'+Education' | 0.1542 | 0.0447 | 0.6620 |
Age group | 0.1344 | 0.0311 | 0.8773 |
Age group + Sex | 0.1511 | 0.0476 | 0.9228 |
| ||||
Age group | - | 0.0311 | 0.1344 | 0.8773 |
| 0.0005 | 0.0414 | 0.1493 | 0.9222 |
| 0.0030 | 0.0375 | 0.1484 | 0.6688 |
| 0.0100 | 0.0000 | 0.0209 | 0.2710 |
Rooms | 0.3443598 | 0.3004656 | 0.8200656 |
| 0.4255117 | 0.3583277 | 0.7911177 |
| 0.4381247 | 0.3901767 | 0.7165204 |
| 0.1484 | 0.0375 | 0.6688 |
(Age group'+Sex)'+Education' | 0.1542 | 0.0447 | 0.6620 |
Age group | 0.1344 | 0.0311 | 0.8773 |
Age group + Sex | 0.1511 | 0.0476 | 0.9228 |
| ||||
Age group | - | 0.0311 | 0.1344 | 0.8773 |
| 0.0005 | 0.0414 | 0.1493 | 0.9222 |
| 0.0030 | 0.0375 | 0.1484 | 0.6688 |
| 0.0100 | 0.0000 | 0.0209 | 0.2710 |
Rooms | 0.3443598 | 0.3004656 | 0.8200656 |
| 0.4255117 | 0.3583277 | 0.7911177 |
| 0.4381247 | 0.3901767 | 0.7165204 |