A novel approach using incremental under sampling for data stream mining

  • Published: 01 November 2017
  • Primary: 68T30; Secondary: 68T05

  • Data stream mining is every popular in recent years with advanced electronic devices generating continuous data streams. The performance of standard learning algorithms has been compromised with imbalance nature present in real world data streams. In this paper, we propose an algorithm known as Increment Under Sampling for Data streams (IUSDS) which uses an unique under sampling technique to almost balance the data sets to minimize the effect of imbalance in stream mining process. The experimental analysis conducted suggests that the proposed algorithm improves the knowledge discovery over benchmark algorithms like C4.5 and Hoeffding tree in terms of standard performance measures namely accuracy, AUC, precision, recall, F-measure, TP rate, FP rate and TN rate.

    Citation: Anupama N, Sudarson Jena. A novel approach using incremental under sampling for data stream mining[J]. Big Data and Information Analytics, 2018, 3(1): 1-13. doi: 10.3934/bdia.2017017

    Related Papers:

    [1] Minlong Lin, Ke Tang . Selective further learning of hybrid ensemble for class imbalanced increment learning. Big Data and Information Analytics, 2017, 2(1): 1-21. doi: 10.3934/bdia.2017005
    [2] Subrata Dasgupta . Disentangling data, information and knowledge. Big Data and Information Analytics, 2016, 1(4): 377-390. doi: 10.3934/bdia.2016016
    [3] Qinglei Zhang, Wenying Feng . Detecting Coalition Attacks in Online Advertising: A hybrid data mining approach. Big Data and Information Analytics, 2016, 1(2): 227-245. doi: 10.3934/bdia.2016006
    [4] Tieliang Gong, Qian Zhao, Deyu Meng, Zongben Xu . Why Curriculum Learning & Self-paced Learning Work in Big/Noisy Data: A Theoretical Perspective. Big Data and Information Analytics, 2016, 1(1): 111-127. doi: 10.3934/bdia.2016.1.111
    [5] Xin Yun, Myung Hwan Chun . The impact of personalized recommendation on purchase intention under the background of big data. Big Data and Information Analytics, 2024, 8(0): 80-108. doi: 10.3934/bdia.2024005
    [6] Pankaj Sharma, David Baglee, Jaime Campos, Erkki Jantunen . Big data collection and analysis for manufacturing organisations. Big Data and Information Analytics, 2017, 2(2): 127-139. doi: 10.3934/bdia.2017002
    [7] Zhen Mei . Manifold Data Mining Helps Businesses Grow More Effectively. Big Data and Information Analytics, 2016, 1(2): 275-276. doi: 10.3934/bdia.2016009
    [8] Ricky Fok, Agnieszka Lasek, Jiye Li, Aijun An . Modeling daily guest count prediction. Big Data and Information Analytics, 2016, 1(4): 299-308. doi: 10.3934/bdia.2016012
    [9] M Supriya, AJ Deepa . Machine learning approach on healthcare big data: a review. Big Data and Information Analytics, 2020, 5(1): 58-75. doi: 10.3934/bdia.2020005
    [10] Sunmoo Yoon, Maria Patrao, Debbie Schauer, Jose Gutierrez . Prediction Models for Burden of Caregivers Applying Data Mining Techniques. Big Data and Information Analytics, 2017, 2(3): 209-217. doi: 10.3934/bdia.2017014
  • Data stream mining is every popular in recent years with advanced electronic devices generating continuous data streams. The performance of standard learning algorithms has been compromised with imbalance nature present in real world data streams. In this paper, we propose an algorithm known as Increment Under Sampling for Data streams (IUSDS) which uses an unique under sampling technique to almost balance the data sets to minimize the effect of imbalance in stream mining process. The experimental analysis conducted suggests that the proposed algorithm improves the knowledge discovery over benchmark algorithms like C4.5 and Hoeffding tree in terms of standard performance measures namely accuracy, AUC, precision, recall, F-measure, TP rate, FP rate and TN rate.



    The modern systems are capable of generating huge volumes of data continuously which naturally form as data streams. Data streams by nature are complex, difficult to analyse and mine knowledge efficiently. If the classes in data streams are of imbalance in nature, then the process of knowledge discovery becomes an extraordinary task. The existing knowledge discovery approaches for data stream mining are bottle-necked in terms of performance. In recent years, considerable research effort has been expended towards identifying and solving these problems, leading to the development of many effective data stream learners.

    However the main focus of this work has been upon learning from data streams where the class balance is heavily skewed, or where it is unrealistic to expect more than a small fraction of the stream to be balanced, in such circumstances a new set of challenges arises that undermine the effectiveness of existing approaches. The recent proposals of clustering approaches from [10,6,16,8] are also facing the same challenges with skewed data distribution in data streams.

    The research community, have contributed to provide effective algorithmic approaches for imbalance data stream learning. A definite practical and thoughtful approaches, with a insight move ahead technology are required to solve the problem of data stream learning in real world applications are needed. Meanwhile, in this work, we extended the core of the recent approaches for solving the problem of skewed data stream learning. A concrete understanding and active exploration in these areas can significantly advance the development of technology for real-world incremental learning scenarios. In this paper, the results of an empirical Increment Under Sampling for Data Streams (IUSDS) is presented with solid theoretical background.

    The rest of this paper is organized as follows:Section 2 presents the recent works on data streams Section 3 presents the main framework of the proposed IUSDS algorithm. Section 4 provides a detailed explanation of the datasets and the evaluation criteria's used in the experiments. Section 5 presents the algorithms used for comparison and experimental setup. Section 6 presents the detailed experimental results and discussion. Section 7 draws the conclusions and points out future research.

    In this section, we first review the major research about data stream learning in classification.

    Iain Brown et al [3] have compared several techniques that can be used in the analysis of imbalanced credit scoring data sets. They progressively increase class imbalance in each of these data sets by randomly under-sampling the minority class of defaulters, so as to identify to what extent the predictive power of the techniques is adversely affected Ana C. Lorena et al [13] have investigated the use of different supervised machine learning techniques to model the potential class imbalance distribution of 35 plant species from Latin America. Victoria López et al [12] proposed an evolutionary framework, which uses an Iterative Instance Adjustment for Imbalanced Domains. Their method iteratively learns the appropriate number of examples that represent the classes and their particular positioning. Their learning process contains three key operations in its design:a customized initialization procedure, an evolutionary optimization of the positioning of the examples and a selection of the most representative examples for each class. Nele Verbiest et al [17] proposed an improved synthetic minority oversampling technique SMOTE in the presence of class noise. Their approach cleans the data before applying SMOTE such that the quality of the generated instances is better and cleans the data after applying SMOTE, such that instances original or introduced by SMOTE that badly fit in the new dataset are also removed.

    Peng Cao et al [4] an effective wrapper approach is proposed by incorporating the evaluation measure directly into the objective function of cost-sensitive neural network to improve the performance of classification, by simultaneously optimizing Particle Swarm Optimization the best pair of feature subset, intrinsic structure parameters and misclassification costs. Yetian Chen [5] has reported two classification tasks based on data from scientific experiment. The first task is a binary classification task which is to maximize accuracy of classification on an evenly-distributed test data set, given a fully labeled imbalanced training data set. The second task is also a binary classification task, but to maximize the F1-score of classification on a test data set, given a partially labeled training set. Doucette et al [7] have proposed a 'Simple Active Learning Heuristic' SALH in which a subset of exemplars is sampled with uniform probability under a class balance enforcing rule for fitness evaluation. Aditya Krishna Menon et al [15] have study consistency with respect to one such performance measure, namely the arithmetic mean of the true positive and true negative rates, and establish that some practically popular approaches, such as applying an empirically determined threshold to a suitable class probability estimate or performing an empirically balanced form of risk minimization, are in fact consistent with respect to the under mild conditions on the underlying distribution.

    Shuo Wang et al [18] have defined class imbalance online, and proposed two learning algorithms OOB and UOB that build an ensemble model overcoming class imbalance in real time through re-sampling and time decayed metrics. In their further work improved the re-sampling strategy inside OOB and UOB, and look into their performance in both static and dynamic data streams. They find that UOB is better at recognizing minority-class examples in static data streams, and OOB is more robust against dynamic changes in class imbalance status. In their further work they proposed a multi-objective ensemble method MOSOB that combines OOB and UOB. MOSOB finds the Pareto-optimal weights for OOB and UOB at each time step, to maximize minority-class recall and majority-class recall simultaneously. They concluded that MOSOB performs well in both static and dynamic data streams.

    Bing Yang et al [20] have given a close attention to the uniqueness of uneven data distribution in imbalance classification problems. Without change the original imbalance training data, they indicated the advantages of proximal classifier for imbalance data classification. In order to improve the accuracy of classification, they proposed a new model named LSNPPC, based the classical proximal SVM models which find two nonparallel planes for data classification. Geoff Hulten et al [9] have proposed a new version of the multi-class Weighted Support Vector Machines WSVM method to perform automatic recognition of activities in a smart home environment. WSVM is capable of solving the class imbalance problem by improving the class accuracy of activity classification compared to other methods like CRF, k-NN and SVM.

    This section presents the detail architecture of the proposed IUSDS approach which consists of four major modules. The detailed working principles of the IUSDS approach are explained below in the sub-sections. In the initial phase of our proposed IUSDS the dataset is split into two subsets known as majority and minority subsets. The majority subset is the class of instances, which are more in percentage than the other class. The minority subset is the class of instances which are very less when compared to the other class in the dataset. Since the proposed approach is a under sampling approach. We considered the majority subset for more elaborate analysis for reduction of instances.

    The instances in the majority subset are reduced by following the below mentioned techniques; one of the technique is to eliminate the noise instances, the other technique is to find the outliers and the final technique is to find the ranges of weak instances for removal.

    The noisy and outlier instances can be easily identified by analyzing the intrinsic properties of the instances. The range of weak instances can be identified by first identifying the weak features in the majority subset. The correlation based feature selection [14] technique selects the important features by following the inter correlation between feature -feature and the inter correlation between feature and class.

    The features which have very less correlation are identified for elimination. The ranges of instances which belong to these weak features are identified for elimination from the majority subset. The number of features and instances eliminate by the correlation based feature selection technique will vary from dataset to dataset depending upon the unique properties of the dataset.

    The detailed procedure of IUSDS is given in the form of algorithm as follows.

    ALGORITHM 1. IUSDS
    Input: S:data stream of examples partitioned into chunks,
    P:Chunks of instances with minority sub class,
    N:Chunks of instances with majority sub class
    jPj<jNj, and Fj, the feature set, j>0.
    Output: Average Measure {AUC, Precision, F-Measure, TP Rate, TN Rate}
    Procedure:
    Processing Phase:
    Step 1. Take the class imbalance data and divide it into majority and minority sub sets. Let the minority subset be Ppi(i=1,2,...,pnum) and majority subset be Nni(i=1,2,...,nnum).
    Selection Phase
    Step 1:begin
    Step 2: k0, j1.
    Step 3:Apply CFS on subset N,
    Step 4:Find Fj from N, k= number of features extracted in CFS
    Step 5:repeat
    Step 6: k=k+1
    Step 7:Select the range for weak or noises instances of Fj.
    Step 8:Remove ranges of weak attributes and form a set of major class examples Nstrong
    Step 9:Until j=k
    Step 10:Form a new dataset using P and Nstrong
    Step 11:End
    Building Predictive Model:
    1. Create a node N
    2. If samples in Nareofsameclass,C then
    3. returnNasaleafnodeandmarkclassC;
    4. If Aisempty then
    5. return Nasaleafnodeandmarkwithmajorityclass;
    6. else
    7. apply Random Forest
    8. endif
    9. endif
    10. ReturnN

    In the concluding phase of the algorithm, the subset in which irrelevant instances are removed is merged with the minority subset to form the strong dataset, which is further applied to the base algorithm for experimental simulation. In this context random forest [11] is used as the base algorithm for experimental simulation and results generation.

    In the study, 15 binary data sets are collected form UCI Repository of Machine Learning Database (School of Information and Computer Science), Irvine, CA:University of California [1], and KEEL [2] which is one of the largest machine learning repository for research community. The dataset comprises of varied application domains to form the active data stream with different sizes of imbalance ratio.

    The individual stream of data arriving is in the range of 57 instances for labor dataset to 3772 instances for sick dataset. The Imbalance Ratio (IR) of individual stream of data arriving is in the range of 1.08 for mushroom dataset to 15.32 for sick dataset. The different properties of datasets such as number of majority instances, number of minority instances etc.. are given below in the Table 1.

    Table 1.  Details of the Dataset.
    S.no Dataset symbol Instances Majority Minority IR
    1. Breast-cancer B1 286 201 85 2.36
    2. Breast-w B2 699 458 241 1.90
    3. Colic C1 368 232 136 1.71
    4. Credit-g C2 1,000 700 300 2.33
    5. Diabetes D1 768 500 268 1.87
    6. Heart-c H1 303 165 138 1.19
    7. Heart-h H2 294 188 10 1.77
    8. Heart-stat H3 270 150 120 1.25
    9. Hepatitis H4 155 123 32 3.85
    10. Ionosphere I1 351 225 126 1.79
    11. Kr-vs-kp K1 3196 1669 1527 1.09
    12. Labor L1 57 37 20 1.85
    13. Mushroom M1 8124 4208 3916 1.08
    14. Sick S1 3772 3541 231 15.32
    15. Sonar S2 208 111 97 1.15

     | Show Table
    DownLoad: CSV

    Table 2 presents the detailed description of the data stream formed for t time. In the initial time t chunk = 1 of B arrives with total instances of 286 with an imbalance ratio of 2.36. Int+1time another chunk arrives and total instances becomes 985 and the imbalance ratio changes to 2.02. In the same way, the data stream chunks arrives and total instances to be processed and the imbalance ratio varies. When all the fifteen chunks arrive the total instances changes to 19851 and the imbalance ratio varies to 1.63.

    Table 2.  Data Stream Description.
    Dataset Instances Majority Minority IR
    Chunk 1:{B1} 286 201 85 2.36
    Chunk 2:{B1, B2} 985 659 326 2.02
    Chunk 3:{B1, B2, C1} 1353 891 462 1.92
    Chunk 4:{B1, B2, C1, C2} 2353 1591 1062 1.49
    Chunk 5:{B1, B2, C1, C2, D1} 3121 2091 1325 1.57
    Chunk 6:{B1, B2, C1, C2, D1, H1} 3424 2256 1463 1.52
    Chunk 7:{B1, B2, C1, C2, D1, H1, H2} 3718 2444 1569 1.55
    Chunk 8:{B1, B2, C1, C2, D1, H1, H2, H3} 3988 2594 1689 1.53
    Chunk 9:{B1, B2, C1, C2, D1, H1, H2, H3, H4} 4143 2717 1721 1.57
    Chunk 10:{B1, B2, C1, C2, D1, H1, H2, H3, H4, I1} 4494 2942 1847 1.59
    Chunk 11:{B1, B2, C1, C2, D1, H1, H2, H3, H4, I1, K1} 7690 4611 3374 1.36
    Chunk 12:{B1, B2, C1, C2, D1, H1, H2, H3, H4, I1, K1, L1} 7747 4648 3394 1.36
    Chunk 13:{B1, B2, C1, C2, D1, H1, H2, H3, H4, I1, K1, L1, M1} 15871 8856 7310 1.21
    Chunk 14:{B1, B2, C1, C2, D1, H1, H2, H3, H4, I1, K1, L1, M1, S1} 19643 12397 7541 1.64
    Chunk 15:{B1, B2, C1, C2, D1, H1, H2, H3, H4, I1, K1, L1, M1, S1, S2} 19851 12508 7638 1.63

     | Show Table
    DownLoad: CSV

    The evaluation metrics used in the paper are detailed below, the percentage of instances correctly classified by a classifier is known as Accuracy. AUC can be computed simple as the micro average of TP rate and TN rate when only single run is available from the clustering algorithm. The AUC is defined as the mean of true positive rate and true negative rate. The formula for AUC is give below,

    AUC=1+TPRATEFRRATE2 (1)

    or

    AUC=TPRATE+TNRATE2. (2)

    The Precision measure is computed by

    Precision=TP(TP)+(FP). (3)

    The Recall measure is computed by,

    Recall=TP(TP)+(FN). (4)

    The F-measure Value is computed by,

    Fmeasure=2×Precision×RecallPrecision+Recall. (5)

    The True Positive Rate [9] measure is computed by,

    TruePositiveRate=TP(TP)+(FN). (6)

    The True Negative Rate measure is computed by

    TrueNegativeRate=TN(TN)+(FP). (7)

    In this section, we present the experimental results of the proposed approach with the two benchmark comparative algorithms C4.5 and Hoeffding Tree. The experimental results of comparative algorithms and proposed IOSDS are implemented and generated in the Weka environment [19]. One of the two best known algorithms are chosen for comparison so as to expose the strengths and the weakness of the proposed approach on different evaluation criteria's.

    Table 3 presents the results of average Accuracy for our proposed algorithm IUSDS, with compared algorithms. We have compared two benchmark algorithms with our proposal; the first is a classical decision tree algorithm C4.5 algorithm. The results of IUSDS algorithms are far better when compared with C4.5; out of the all 15 incremental data stream chunks, our proposed IUSDS algorithm have won on all 15. The second algorithm compared with IUSDS is Hoeffding Tree, which one of the latest data stream learning algorithm. When compared with Hoeffding Tree also our proposed IUSDS algorithm have won on all 15 incremental data stream chunks in terms of accuracy.

    Table 3.  Average TN Rate for IUSDS verses C4.5 during the 11 time stamps after each status change for chunk-by-chunk learning.
    Chunk no C4.5 HoeffdingTree IUSDS
    Chunk 1 (maj=201; min=85) 0.260 0.395 0.325
    Chunk 2 (maj=659; min=326) 0.596 0.685 0.652
    Chunk 3 (maj=891; min=462) 0.636 0.693 0.689
    Chunk 4 (maj=1591; min=762) 0.577 0.642 0.624
    Chunk 5 (maj=2091; min=1030) 0.582 0.634 0.638
    Chunk 6 (maj=2214; min=1062) 0.635 0.674 0.685
    Chunk 7 (maj=2439; min=1188) 0.679 0.707 0.723
    Chunk 8 (maj=2476; min=1208) 0.702 0.738 0.740
    Chunk 9 (maj=6017; min=1438) 0.721 0.667 0.759
    Chunk 10 (maj=6128; min=1536) 0.724 0.657 0.757
    Chunk 11 (maj=6395; min=1704) 0.745 0.684 0.778
    Bold dot indicates the win of IUSDS; Empty dot indicates the loss of IUSDS

     | Show Table
    DownLoad: CSV

    Table 4 presents the results of average AUC for our proposed algorithm IUSDS, with compared algorithms. The results of IUSDS algorithms are efficient when compared with C4.5; out of the all 15 incremental data stream chunks, our proposed IUSDS algorithm have won on all 15. When compared with Hoeffding Tree our proposed IUSDS algorithm have achieved 6 wins, 2 ties and 7 losses out of 15 incremental data stream chunks. The trends of accuracy, TP rate and TN Rate are shown in the figure 1 and 2 respectively. The trends in figures also show that our proposed IUSDS algorithm had performed well against the compared algorithms.

    Table 4.  Average Accuracy for IUSDS verses C4.5 during the 11 time stamps after each status change for chunk-by-chunk learning.
    Chunk no C4.5 HoeffdingTree IUSDS
    Chunk 1 (maj=201; min=85) 74.28 72.18 71.73
    Chunk 2 (maj=659; min=326) 84.64 84.09 84.94
    Chunk 3 (maj=891; min=462) 84.81 81.90 84.03
    Chunk 4 (maj=1591; min=762) 81.42 80.19 79.10
    Chunk 5 (maj=2091; min=1030) 80.04 79.30 79.26
    Chunk 6 (maj=2214; min=1062) 79.90 79.86 79.59
    Chunk 7 (maj=2439; min=1188) 81.31 81.06 81.23
    Chunk 8 (maj=2476; min=1208) 80.97 82.15 81.01
    Chunk 9 (maj=6017; min=1438) 82.94 83.44 82.94
    Chunk 10 (maj=6128; min=1536) 82.01 81.92 82.17
    Chunk 11 (maj=6395; min=1704) 83.33 83.11 83.52
    Bold dot indicates the win of IUSDS; Empty dot indicates the loss of IUSDS

     | Show Table
    DownLoad: CSV
    Figure 1.  Trends in TN Rate for C4.5, Hoeffding Tree versus IUSDS on data stream.
    Figure 2.  Trends in FP Rate for C4.5, Hoeffding Tree versus IUSDS on data stream.

    Table 5, 6, 7, 8, 9 and 10 presents the results of average precision, recall, F-measure, TP Rate, FP Rate ad TN Rate respectively for our proposed algorithm IUSDS, with compared algorithms.

    Table 5.  Average FP Rate for IUSDS verses C4.5 during the 11 time stamps after each status change for chunk-by-chunk learning.
    Chunk no C4.5 HoeffdingTree IUSDS
    Chunk 1 (maj=201; min=85) 0.740 0.605 0.675
    Chunk 2 (maj=659; min=326) 0.404 0.315 0.348
    Chunk 3 (maj=891; min=462) 0.364 0.307 0.311
    Chunk 4 (maj=1591; min=762) 0.423 0.358 0.376
    Chunk 5 (maj=2091; min=1030) 0.418 0.366 0.362
    Chunk 6 (maj=2214; min=1062) 0.365 0.326 0.315
    Chunk 7 (maj=2439; min=1188) 0.321 0.293 0.277
    Chunk 8 (maj=2476; min=1208) 0.298 0.262 0.260
    Chunk 9 (maj=6017; min=1438) 0.279 0.333 0.241
    Chunk 10 (maj=6128; min=1536) 0.276 0.343 0.243
    Chunk 11 (maj=6395; min=1704) 0.255 0.316 0.222
    Bold dot indicates the win of IUSDS; Empty dot indicates the loss of IUSDS

     | Show Table
    DownLoad: CSV
    Table 6.  Average AUC for IUSDS verses C4.5 during the 11 time stamps after each status change for chunk-by-chunk learning.
    Chunk no C4.5 HoeffdingTree IUSDS
    Chunk 1 (maj=201; min=85) 0.606 0.683 0.637
    Chunk 2 (maj=659; min=326) 0.782 0.836 0.812
    Chunk 3 (maj=891; min=462) 0.802 0.832 0.833
    Chunk 4 (maj=1591; min=762) 0.764 0.820 0.777
    Chunk 5 (maj=2091; min=1030) 0.761 0.818 0.787
    Chunk 6 (maj=2214; min=1062) 0.746 0.819 0.775
    Chunk 7 (maj=2439; min=1188) 0.766 0.836 0.795
    Chunk 8 (maj=2476; min=1208) 0.761 0.845 0.791
    Chunk 9 (maj=6017; min=1438) 0.782 0.813 0.810
    Chunk 10 (maj=6128; min=1536) 0.779 0.812 0.806
    Chunk 11 (maj=6395; min=1704) 0.798 0.826 0.821
    Bold dot indicates the win of IUSDS; Empty dot indicates the loss of IUSDS

     | Show Table
    DownLoad: CSV
    Table 7.  Average Precision for IUSDS verses C4.5 during the 11 time stamps after each status change for chunk-by-chunk learning.
    Chunk no C4.5 HoeffdingTree IUSDS
    Chunk 1 (maj=201; min=85) 0.753 0.774 0.736
    Chunk 2 (maj=659; min=326) 0.859 0.881 0.861
    Chunk 3 (maj=891; min=462) 0.856 0.866 0.836
    Chunk 4 (maj=1591; min=762) 0.834 0.849 0.800
    Chunk 5 (maj=2091; min=1030) 0.827 0.839 0.802
    Chunk 6 (maj=2214; min=1062) 0.774 0.795 0.785
    Chunk 7 (maj=2439; min=1188) 0.791 0.802 0.804
    Chunk 8 (maj=2476; min=1208) 0.779 0.811 0.798
    Chunk 9 (maj=6017; min=1438) 0.803 0.826 0.819
    Chunk 10 (maj=6128; min=1536) 0.795 0.807 0.813
    Chunk 11 (maj=6395; min=1704) 0.811 0.822 0.828
    Bold dot indicates the win of IUSDS; Empty dot indicates the loss of IUSDS

     | Show Table
    DownLoad: CSV
    Table 8.  Average Recall for IUSDS verses C4.5 during the 15 time stamps after each status change for chunk-by-chunk learning.
    Chunk no C4.5 HoeffdingTree IUSDS
    Chunk 1 (maj=201; min=85) 0.947 0.860 0.909
    Chunk 2 (maj=659; min=326) 0.953 0.906 0.946
    Chunk 3 (maj=891; min=462) 0.946 0.875 0.925
    Chunk 4 (maj=1591; min=762) 0.921 0.872 0.888
    Chunk 5 (maj=2091; min=1030) 0.901 0.866 0.884
    Chunk 6 (maj=2214; min=1062) 0.813 0.828 0.820
    Chunk 7 (maj=2439; min=1188) 0.814 0.830 0.824
    Chunk 8 (maj=2476; min=1208) 0.793 0.826 0.808
    Chunk 9 (maj=6017; min=1438) 0.815 0.844 0.828
    Chunk 10 (maj=6128; min=1536) 0.806 0.841 0.822
    Chunk 11 (maj=6395; min=1704) 0.821 0.851 0.833
    Bold dot indicates the win of IUSDS; Empty dot indicates the loss of IUSDS

     | Show Table
    DownLoad: CSV
    Table 9.  Average F-measure for IUSDS verses C4.5 during the 11 time stamps after each status change for chunk-by-chunk learning.
    Chunk no C4.5 HoeffdingTree IUSDS
    Chunk 1 (maj=201; min=85) 0.838 0.812 0.812
    Chunk 2 (maj=659; min=326) 0.900 0.890 0.898
    Chunk 3 (maj=891; min=462) 0.896 0.867 0.874
    Chunk 4 (maj=1591; min=762) 0.873 0.857 0.838
    Chunk 5 (maj=2091; min=1030) 0.860 0.849 0.838
    Chunk 6 (maj=2214; min=1062) 0.785 0.803 0.791
    Chunk 7 (maj=2439; min=1188) 0.794 0.808 0.804
    Chunk 8 (maj=2476; min=1208) 0.774 0.810 0.790
    Chunk 9 (maj=6017; min=1438) 0.798 0.827 0.813
    Chunk 10 (maj=6128; min=1536) 0.790 0.815 0.807
    Chunk 11 (maj=6395; min=1704) 0.807 0.828 0.821
    Bold dot indicates the win of IUSDS; Empty dot indicates the loss of IUSDS

     | Show Table
    DownLoad: CSV
    Table 10.  Average FN Rate for IUSDS verses C4.5 during the 11 time stamps after each status change for chunk-by-chunk learning.
    Chunk no C4.5 HoeffdingTree IUSDS
    Chunk 1 (maj=201; min=85) 0.053 0.140 0.091
    Chunk 2 (maj=659; min=326) 0.047 0.094 0.054
    Chunk 3 (maj=891; min=462) 0.054 0.125 0.075
    Chunk 4 (maj=1591; min=762) 0.079 0.128 0.112
    Chunk 5 (maj=2091; min=1030) 0.099 0.134 0.116
    Chunk 6 (maj=2214; min=1062) 0.187 0.172 0.180
    Chunk 7 (maj=2439; min=1188) 0.186 0.170 0.176
    Chunk 8 (maj=2476; min=1208) 0.207 0.174 0.192
    Chunk 9 (maj=6017; min=1438) 0.185 0.156 0.172
    Chunk 10 (maj=6128; min=1536) 0.194 0.159 0.178
    Chunk 11 (maj=6395; min=1704) 0.179 0.149 0.167
    Bold dot indicates the win of IUSDS; Empty dot indicates the loss of IUSDS

     | Show Table
    DownLoad: CSV

    The results of IUSDS algorithms are efficient when compared with C4.5 and Hoeffding Tree on all 15 incremental data stream chunks. The summary of the wins, ties and losses are given in the table 11. The proposed IUSDS approach is efficient to solve the real time issues of knowledge discovery for imbalance and noisy data streams. The experimental results clearly indicate that, the IUSDS approach performs better on the data streams than the classical approaches.

    Table 11.  Summary of experimental results for IUSDS.
    Results Systems Wins Ties Losses
    TN Rate IUSDS v/s C4.5 11 0 0
    IUSDS v/s HoeffdingTree 11 0 0
    Accuracy IUSDS v/s C4.5 04 1 6
    IUSDS v/s HoeffdingTree 05 0 6
    FP Rate IUSDS v/s C4.5 11 0 0
    IUSDS v/s HoeffdingTree 11 0 0
    AUC IUSDS v/s C4.5 11 0 0
    IUSDS v/s HoeffdingTree 1 0 10
    Precision IUSDS v/s C4.5 10 0 1
    IUSDS v/s HoeffdingTree 08 1 02
    Recall IUSDS v/s C4.5 05 03 03
    IUSDS v/s HoeffdingTree 11 0 0
    F-measure IUSDS v/s C4.5 09 01 01
    IUSDS v/s HoeffdingTree 10 00 01
    FN Rate IUSDS v/s C4.5 05 03 03
    IUSDS v/s HoeffdingTree 11 00 00

     | Show Table
    DownLoad: CSV

    In this paper, Incremental Under Sampling for Data Stream algorithm is proposed using an efficient and unique under sampling strategy. IUSDS approach under samples the prominent and class oriented instances from the minority subset for better knowledge discovery from the data streams. The experimental results suggest that the IUSDS approach efficiently discovers knowledge from the imbalance data streams. In future work, we want to mitigate the solution for the problem of concept drift for imbalance data streams.

    [1] Alcalá-Fdez J., Fernandez A., Luengo J., Derrac J., García S., Sánchez L., Herrera F. (2011) KEEL data-mining software tool: Data set repository, Integration of Algorithms and Experimental Analysis Framework. Journal of Multiple-Valued Logic and Soft Computing 17: 255-287.
    [2] A. Asuncion and D. J. Newman, UCI Repository of Machine Learning Database (School of Information and Computer Science), Irvine, CA: Univ. of California [Online], 2007. Available: http://www.ics.uci.edu/mlearn/MLRepository.html
    [3] Brown I., Mues C. (2012) An experimental comparison of classification algorithms for imbalanced credit scoring data sets. Expert Systems with Applications 39: 3446-3453. doi: 10.1016/j.eswa.2011.09.033
    [4] Cao P., Zhao D., Zaiane O. (2013) A PSO-based cost-sensitive neural network for imbalanced data classification. Trends and Applications in Knowledge Discovery and Data Mining 452-463. doi: 10.1007/978-3-642-40319-4_39
    [5] Y. Chen, Learning Classifiers from Imbalanced Only Positive and Unlabeled Data Sets 2008 UC San Diego Data Mining Contest.
    [6] Chen Y., Tang S., Zhou L., Wang C., Du J., Wang T., Pei S. (2018) Decentralized Clustering by Finding Loose and Distributed Density Cores. Inform. Sci. 433/434: 510-526. doi: 10.1016/j.ins.2016.08.009
    [7] Doucette, Heywood M. I. (2008) Classification under imbalanced data sets:Active sub-sampling and auc approximation. M. O'Neill et al. Eds.:EuroGP 2008, LNCS 4971: 266-277.
    [8] Frey B. J., Dueck D. (2007) Clustering by passing messages between data points. Science 315: 972-976. doi: 10.1126/science.1136800
    [9] G. Hulten, L. Spencer and P. Domingos, Mining time-changing data streams, In: ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, (2001), 97-106.

    10.1145/502512.502529

    [10] Jain A. K. (2008) Data clustering:50 years beyond K-means. Part of the Lecture Notes in Computer Science book series 5211: 3-4. doi: 10.1007/978-3-540-87479-9_3
    [11] R. Kohavi, Scaling up the accuracy of Naive-Bayes classifiers: A decision-tree hybrid, In: Second International Conference on Knoledge Discovery and Data Mining, (1996), 202-207.
    [12] López V., Triguero I., Carmona C. J., García S., Herrera F. (2014) Addressing imbalanced classification withinstance generation techniques: IPADE-ID. Neurocomputing 126: 15-28. doi: 10.1016/j.neucom.2013.01.050
    [13] Lorena A. C., Jacintho L. F. O., Siqueira M. F., De Giovanni R., Lohmann L. G., de Carvalho A. C. P. L. F., Yamamoto M. (2011) Comparing machine learning classifiers in potential distribution modelling. Expert Systems with Applications 38: 5268-5275. doi: 10.1016/j.eswa.2010.10.031
    [14] H. Ma, Correlation-based Feature Subset Selection For Machine Learning PhD Thesis, 1998.
    [15] A. K. Menon, H. Narasimhan, S. Agarwal and S. Chawla, On the statistical consistency of algorithms for binary classification under class imbalance, Appearing in Proceedings of the 30 thInternational Conference on Machine Learning Atlanta, Georgia, USA, 2013.
    [16] Rodriguez A., Laio A. (2014) Clustering by fast search and find of density peaks. Science 344: 1492-1496. doi: 10.1126/science.1242072
    [17] Verbiesta N., Ramentol E., Cornelisa C., Herrera F. (2014) Preprocessing noisy imbalanced datasets using SMOTE enhanced withfuzzy rough prototype selection. Applied Soft Computing 22: 511-517. doi: 10.1016/j.asoc.2014.05.023
    [18] Wang S., Minku L. L., Yao X. (2015) Resampling-based ensemble methods for online class imbalance learning. IEEE Transactions on Knowledge and Data Engineering 27: 1356-1368. doi: 10.1109/TKDE.2014.2345380
    [19] Witten I. H., Frank E. (2002) Data mining:Practical machine learning tools and techniques. Newsletter: ACM SIGMOD Record Homepage Archive 31: 76-77. doi: 10.1145/507338.507355
    [20] B. Yang and L. Jing, A Novel nonparallel plane proximal svm for imbalance data classification Journal of Software, 9 2014.
  • Reader Comments
  • © 2018 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(4811) PDF downloads(1792) Cited by(0)

Figures and Tables

Figures(2)  /  Tables(11)

Other Articles By Authors

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog