1.
Introduction
The modern systems are capable of generating huge volumes of data continuously which naturally form as data streams. Data streams by nature are complex, difficult to analyse and mine knowledge efficiently. If the classes in data streams are of imbalance in nature, then the process of knowledge discovery becomes an extraordinary task. The existing knowledge discovery approaches for data stream mining are bottle-necked in terms of performance. In recent years, considerable research effort has been expended towards identifying and solving these problems, leading to the development of many effective data stream learners.
However the main focus of this work has been upon learning from data streams where the class balance is heavily skewed, or where it is unrealistic to expect more than a small fraction of the stream to be balanced, in such circumstances a new set of challenges arises that undermine the effectiveness of existing approaches. The recent proposals of clustering approaches from [10,6,16,8] are also facing the same challenges with skewed data distribution in data streams.
The research community, have contributed to provide effective algorithmic approaches for imbalance data stream learning. A definite practical and thoughtful approaches, with a insight move ahead technology are required to solve the problem of data stream learning in real world applications are needed. Meanwhile, in this work, we extended the core of the recent approaches for solving the problem of skewed data stream learning. A concrete understanding and active exploration in these areas can significantly advance the development of technology for real-world incremental learning scenarios. In this paper, the results of an empirical Increment Under Sampling for Data Streams (IUSDS) is presented with solid theoretical background.
The rest of this paper is organized as follows:Section 2 presents the recent works on data streams Section 3 presents the main framework of the proposed IUSDS algorithm. Section 4 provides a detailed explanation of the datasets and the evaluation criteria's used in the experiments. Section 5 presents the algorithms used for comparison and experimental setup. Section 6 presents the detailed experimental results and discussion. Section 7 draws the conclusions and points out future research.
2.
Related work
In this section, we first review the major research about data stream learning in classification.
Iain Brown et al [3] have compared several techniques that can be used in the analysis of imbalanced credit scoring data sets. They progressively increase class imbalance in each of these data sets by randomly under-sampling the minority class of defaulters, so as to identify to what extent the predictive power of the techniques is adversely affected Ana C. Lorena et al [13] have investigated the use of different supervised machine learning techniques to model the potential class imbalance distribution of 35 plant species from Latin America. Victoria López et al [12] proposed an evolutionary framework, which uses an Iterative Instance Adjustment for Imbalanced Domains. Their method iteratively learns the appropriate number of examples that represent the classes and their particular positioning. Their learning process contains three key operations in its design:a customized initialization procedure, an evolutionary optimization of the positioning of the examples and a selection of the most representative examples for each class. Nele Verbiest et al [17] proposed an improved synthetic minority oversampling technique SMOTE in the presence of class noise. Their approach cleans the data before applying SMOTE such that the quality of the generated instances is better and cleans the data after applying SMOTE, such that instances original or introduced by SMOTE that badly fit in the new dataset are also removed.
Peng Cao et al [4] an effective wrapper approach is proposed by incorporating the evaluation measure directly into the objective function of cost-sensitive neural network to improve the performance of classification, by simultaneously optimizing Particle Swarm Optimization the best pair of feature subset, intrinsic structure parameters and misclassification costs. Yetian Chen [5] has reported two classification tasks based on data from scientific experiment. The first task is a binary classification task which is to maximize accuracy of classification on an evenly-distributed test data set, given a fully labeled imbalanced training data set. The second task is also a binary classification task, but to maximize the F1-score of classification on a test data set, given a partially labeled training set. Doucette et al [7] have proposed a 'Simple Active Learning Heuristic' SALH in which a subset of exemplars is sampled with uniform probability under a class balance enforcing rule for fitness evaluation. Aditya Krishna Menon et al [15] have study consistency with respect to one such performance measure, namely the arithmetic mean of the true positive and true negative rates, and establish that some practically popular approaches, such as applying an empirically determined threshold to a suitable class probability estimate or performing an empirically balanced form of risk minimization, are in fact consistent with respect to the under mild conditions on the underlying distribution.
Shuo Wang et al [18] have defined class imbalance online, and proposed two learning algorithms OOB and UOB that build an ensemble model overcoming class imbalance in real time through re-sampling and time decayed metrics. In their further work improved the re-sampling strategy inside OOB and UOB, and look into their performance in both static and dynamic data streams. They find that UOB is better at recognizing minority-class examples in static data streams, and OOB is more robust against dynamic changes in class imbalance status. In their further work they proposed a multi-objective ensemble method MOSOB that combines OOB and UOB. MOSOB finds the Pareto-optimal weights for OOB and UOB at each time step, to maximize minority-class recall and majority-class recall simultaneously. They concluded that MOSOB performs well in both static and dynamic data streams.
Bing Yang et al [20] have given a close attention to the uniqueness of uneven data distribution in imbalance classification problems. Without change the original imbalance training data, they indicated the advantages of proximal classifier for imbalance data classification. In order to improve the accuracy of classification, they proposed a new model named LSNPPC, based the classical proximal SVM models which find two nonparallel planes for data classification. Geoff Hulten et al [9] have proposed a new version of the multi-class Weighted Support Vector Machines WSVM method to perform automatic recognition of activities in a smart home environment. WSVM is capable of solving the class imbalance problem by improving the class accuracy of activity classification compared to other methods like CRF, k-NN and SVM.
3.
Proposed increment under sampling for data streams (IUSDS) algorithm
This section presents the detail architecture of the proposed IUSDS approach which consists of four major modules. The detailed working principles of the IUSDS approach are explained below in the sub-sections. In the initial phase of our proposed IUSDS the dataset is split into two subsets known as majority and minority subsets. The majority subset is the class of instances, which are more in percentage than the other class. The minority subset is the class of instances which are very less when compared to the other class in the dataset. Since the proposed approach is a under sampling approach. We considered the majority subset for more elaborate analysis for reduction of instances.
The instances in the majority subset are reduced by following the below mentioned techniques; one of the technique is to eliminate the noise instances, the other technique is to find the outliers and the final technique is to find the ranges of weak instances for removal.
The noisy and outlier instances can be easily identified by analyzing the intrinsic properties of the instances. The range of weak instances can be identified by first identifying the weak features in the majority subset. The correlation based feature selection [14] technique selects the important features by following the inter correlation between feature -feature and the inter correlation between feature and class.
The features which have very less correlation are identified for elimination. The ranges of instances which belong to these weak features are identified for elimination from the majority subset. The number of features and instances eliminate by the correlation based feature selection technique will vary from dataset to dataset depending upon the unique properties of the dataset.
The detailed procedure of IUSDS is given in the form of algorithm as follows.
In the concluding phase of the algorithm, the subset in which irrelevant instances are removed is merged with the minority subset to form the strong dataset, which is further applied to the base algorithm for experimental simulation. In this context random forest [11] is used as the base algorithm for experimental simulation and results generation.
4.
Data set and evaluation criterias
In the study, 15 binary data sets are collected form UCI Repository of Machine Learning Database (School of Information and Computer Science), Irvine, CA:University of California [1], and KEEL [2] which is one of the largest machine learning repository for research community. The dataset comprises of varied application domains to form the active data stream with different sizes of imbalance ratio.
The individual stream of data arriving is in the range of 57 instances for labor dataset to 3772 instances for sick dataset. The Imbalance Ratio (IR) of individual stream of data arriving is in the range of 1.08 for mushroom dataset to 15.32 for sick dataset. The different properties of datasets such as number of majority instances, number of minority instances etc.. are given below in the Table 1.
Table 2 presents the detailed description of the data stream formed for t time. In the initial time t chunk = 1 of B arrives with total instances of 286 with an imbalance ratio of 2.36. Int+1time another chunk arrives and total instances becomes 985 and the imbalance ratio changes to 2.02. In the same way, the data stream chunks arrives and total instances to be processed and the imbalance ratio varies. When all the fifteen chunks arrive the total instances changes to 19851 and the imbalance ratio varies to 1.63.
The evaluation metrics used in the paper are detailed below, the percentage of instances correctly classified by a classifier is known as Accuracy. AUC can be computed simple as the micro average of TP rate and TN rate when only single run is available from the clustering algorithm. The AUC is defined as the mean of true positive rate and true negative rate. The formula for AUC is give below,
or
The Precision measure is computed by
The Recall measure is computed by,
The F-measure Value is computed by,
The True Positive Rate [9] measure is computed by,
The True Negative Rate measure is computed by
5.
Experimental results
In this section, we present the experimental results of the proposed approach with the two benchmark comparative algorithms C4.5 and Hoeffding Tree. The experimental results of comparative algorithms and proposed IOSDS are implemented and generated in the Weka environment [19]. One of the two best known algorithms are chosen for comparison so as to expose the strengths and the weakness of the proposed approach on different evaluation criteria's.
Table 3 presents the results of average Accuracy for our proposed algorithm IUSDS, with compared algorithms. We have compared two benchmark algorithms with our proposal; the first is a classical decision tree algorithm C4.5 algorithm. The results of IUSDS algorithms are far better when compared with C4.5; out of the all 15 incremental data stream chunks, our proposed IUSDS algorithm have won on all 15. The second algorithm compared with IUSDS is Hoeffding Tree, which one of the latest data stream learning algorithm. When compared with Hoeffding Tree also our proposed IUSDS algorithm have won on all 15 incremental data stream chunks in terms of accuracy.
Table 4 presents the results of average AUC for our proposed algorithm IUSDS, with compared algorithms. The results of IUSDS algorithms are efficient when compared with C4.5; out of the all 15 incremental data stream chunks, our proposed IUSDS algorithm have won on all 15. When compared with Hoeffding Tree our proposed IUSDS algorithm have achieved 6 wins, 2 ties and 7 losses out of 15 incremental data stream chunks. The trends of accuracy, TP rate and TN Rate are shown in the figure 1 and 2 respectively. The trends in figures also show that our proposed IUSDS algorithm had performed well against the compared algorithms.
Table 5, 6, 7, 8, 9 and 10 presents the results of average precision, recall, F-measure, TP Rate, FP Rate ad TN Rate respectively for our proposed algorithm IUSDS, with compared algorithms.
The results of IUSDS algorithms are efficient when compared with C4.5 and Hoeffding Tree on all 15 incremental data stream chunks. The summary of the wins, ties and losses are given in the table 11. The proposed IUSDS approach is efficient to solve the real time issues of knowledge discovery for imbalance and noisy data streams. The experimental results clearly indicate that, the IUSDS approach performs better on the data streams than the classical approaches.
6.
Conclusion
In this paper, Incremental Under Sampling for Data Stream algorithm is proposed using an efficient and unique under sampling strategy. IUSDS approach under samples the prominent and class oriented instances from the minority subset for better knowledge discovery from the data streams. The experimental results suggest that the IUSDS approach efficiently discovers knowledge from the imbalance data streams. In future work, we want to mitigate the solution for the problem of concept drift for imbalance data streams.