A novel approach using incremental under sampling for data stream mining

Anupama N; Sudarson Jena; Anupama N; Sudarson Jena

doi:10.3934/bdia.2017017

Big Data and Information Analytics

2018, Volume 3, Issue 1: 1-13. doi: 10.3934/bdia.2017017

Previous Article Next Article

A novel approach using incremental under sampling for data stream mining

Anupama N ^{1
,
,},
Sudarson Jena ^{2
,}

1.
Research Scholar, GITAM University, Telangana, Hyderabad, India
2.
Sambalpur University Institute of Information Technology, Sambalpur, Orissa, India

Published: 01 November 2017
Primary: 68T30; Secondary: 68T05

Data stream mining is every popular in recent years with advanced electronic devices generating continuous data streams. The performance of standard learning algorithms has been compromised with imbalance nature present in real world data streams. In this paper, we propose an algorithm known as Increment Under Sampling for Data streams (IUSDS) which uses an unique under sampling technique to almost balance the data sets to minimize the effect of imbalance in stream mining process. The experimental analysis conducted suggests that the proposed algorithm improves the knowledge discovery over benchmark algorithms like C4.5 and Hoeffding tree in terms of standard performance measures namely accuracy, AUC, precision, recall, F-measure, TP rate, FP rate and TN rate.

Keywords:

Knowledge discovery,
data streams,
imbalanced data,
oversampling,
increment under sampling for data streams (IUSDS)

Citation: Anupama N, Sudarson Jena. A novel approach using incremental under sampling for data stream mining[J]. Big Data and Information Analytics, 2018, 3(1): 1-13. doi: 10.3934/bdia.2017017

Related Papers:

[1]	Minlong Lin, Ke Tang . Selective further learning of hybrid ensemble for class imbalanced increment learning. Big Data and Information Analytics, 2017, 2(1): 1-21. doi: 10.3934/bdia.2017005
[2]	Subrata Dasgupta . Disentangling data, information and knowledge. Big Data and Information Analytics, 2016, 1(4): 377-390. doi: 10.3934/bdia.2016016
[3]	Qinglei Zhang, Wenying Feng . Detecting Coalition Attacks in Online Advertising: A hybrid data mining approach. Big Data and Information Analytics, 2016, 1(2): 227-245. doi: 10.3934/bdia.2016006
[4]	Tieliang Gong, Qian Zhao, Deyu Meng, Zongben Xu . Why Curriculum Learning & Self-paced Learning Work in Big/Noisy Data: A Theoretical Perspective. Big Data and Information Analytics, 2016, 1(1): 111-127. doi: 10.3934/bdia.2016.1.111
[5]	Xin Yun, Myung Hwan Chun . The impact of personalized recommendation on purchase intention under the background of big data. Big Data and Information Analytics, 2024, 8(0): 80-108. doi: 10.3934/bdia.2024005
[6]	Pankaj Sharma, David Baglee, Jaime Campos, Erkki Jantunen . Big data collection and analysis for manufacturing organisations. Big Data and Information Analytics, 2017, 2(2): 127-139. doi: 10.3934/bdia.2017002
[7]	Zhen Mei . Manifold Data Mining Helps Businesses Grow More Effectively. Big Data and Information Analytics, 2016, 1(2): 275-276. doi: 10.3934/bdia.2016009
[8]	Ricky Fok, Agnieszka Lasek, Jiye Li, Aijun An . Modeling daily guest count prediction. Big Data and Information Analytics, 2016, 1(4): 299-308. doi: 10.3934/bdia.2016012
[9]	M Supriya, AJ Deepa . Machine learning approach on healthcare big data: a review. Big Data and Information Analytics, 2020, 5(1): 58-75. doi: 10.3934/bdia.2020005
[10]	Sunmoo Yoon, Maria Patrao, Debbie Schauer, Jose Gutierrez . Prediction Models for Burden of Caregivers Applying Data Mining Techniques. Big Data and Information Analytics, 2017, 2(3): 209-217. doi: 10.3934/bdia.2017014

Abstract

1. Introduction

The modern systems are capable of generating huge volumes of data continuously which naturally form as data streams. Data streams by nature are complex, difficult to analyse and mine knowledge efficiently. If the classes in data streams are of imbalance in nature, then the process of knowledge discovery becomes an extraordinary task. The existing knowledge discovery approaches for data stream mining are bottle-necked in terms of performance. In recent years, considerable research effort has been expended towards identifying and solving these problems, leading to the development of many effective data stream learners.

However the main focus of this work has been upon learning from data streams where the class balance is heavily skewed, or where it is unrealistic to expect more than a small fraction of the stream to be balanced, in such circumstances a new set of challenges arises that undermine the effectiveness of existing approaches. The recent proposals of clustering approaches from [10,6,16,8] are also facing the same challenges with skewed data distribution in data streams.

The research community, have contributed to provide effective algorithmic approaches for imbalance data stream learning. A definite practical and thoughtful approaches, with a insight move ahead technology are required to solve the problem of data stream learning in real world applications are needed. Meanwhile, in this work, we extended the core of the recent approaches for solving the problem of skewed data stream learning. A concrete understanding and active exploration in these areas can significantly advance the development of technology for real-world incremental learning scenarios. In this paper, the results of an empirical Increment Under Sampling for Data Streams (IUSDS) is presented with solid theoretical background.

The rest of this paper is organized as follows:Section 2 presents the recent works on data streams Section 3 presents the main framework of the proposed IUSDS algorithm. Section 4 provides a detailed explanation of the datasets and the evaluation criteria's used in the experiments. Section 5 presents the algorithms used for comparison and experimental setup. Section 6 presents the detailed experimental results and discussion. Section 7 draws the conclusions and points out future research.

2. Related work

In this section, we first review the major research about data stream learning in classification.

Iain Brown et al [3] have compared several techniques that can be used in the analysis of imbalanced credit scoring data sets. They progressively increase class imbalance in each of these data sets by randomly under-sampling the minority class of defaulters, so as to identify to what extent the predictive power of the techniques is adversely affected Ana C. Lorena et al [13] have investigated the use of different supervised machine learning techniques to model the potential class imbalance distribution of 35 plant species from Latin America. Victoria López et al [12] proposed an evolutionary framework, which uses an Iterative Instance Adjustment for Imbalanced Domains. Their method iteratively learns the appropriate number of examples that represent the classes and their particular positioning. Their learning process contains three key operations in its design:a customized initialization procedure, an evolutionary optimization of the positioning of the examples and a selection of the most representative examples for each class. Nele Verbiest et al [17] proposed an improved synthetic minority oversampling technique SMOTE in the presence of class noise. Their approach cleans the data before applying SMOTE such that the quality of the generated instances is better and cleans the data after applying SMOTE, such that instances original or introduced by SMOTE that badly fit in the new dataset are also removed.

Peng Cao et al [4] an effective wrapper approach is proposed by incorporating the evaluation measure directly into the objective function of cost-sensitive neural network to improve the performance of classification, by simultaneously optimizing Particle Swarm Optimization the best pair of feature subset, intrinsic structure parameters and misclassification costs. Yetian Chen [5] has reported two classification tasks based on data from scientific experiment. The first task is a binary classification task which is to maximize accuracy of classification on an evenly-distributed test data set, given a fully labeled imbalanced training data set. The second task is also a binary classification task, but to maximize the F1-score of classification on a test data set, given a partially labeled training set. Doucette et al [7] have proposed a 'Simple Active Learning Heuristic' SALH in which a subset of exemplars is sampled with uniform probability under a class balance enforcing rule for fitness evaluation. Aditya Krishna Menon et al [15] have study consistency with respect to one such performance measure, namely the arithmetic mean of the true positive and true negative rates, and establish that some practically popular approaches, such as applying an empirically determined threshold to a suitable class probability estimate or performing an empirically balanced form of risk minimization, are in fact consistent with respect to the under mild conditions on the underlying distribution.

Shuo Wang et al [18] have defined class imbalance online, and proposed two learning algorithms OOB and UOB that build an ensemble model overcoming class imbalance in real time through re-sampling and time decayed metrics. In their further work improved the re-sampling strategy inside OOB and UOB, and look into their performance in both static and dynamic data streams. They find that UOB is better at recognizing minority-class examples in static data streams, and OOB is more robust against dynamic changes in class imbalance status. In their further work they proposed a multi-objective ensemble method MOSOB that combines OOB and UOB. MOSOB finds the Pareto-optimal weights for OOB and UOB at each time step, to maximize minority-class recall and majority-class recall simultaneously. They concluded that MOSOB performs well in both static and dynamic data streams.

Bing Yang et al [20] have given a close attention to the uniqueness of uneven data distribution in imbalance classification problems. Without change the original imbalance training data, they indicated the advantages of proximal classifier for imbalance data classification. In order to improve the accuracy of classification, they proposed a new model named LSNPPC, based the classical proximal SVM models which find two nonparallel planes for data classification. Geoff Hulten et al [9] have proposed a new version of the multi-class Weighted Support Vector Machines WSVM method to perform automatic recognition of activities in a smart home environment. WSVM is capable of solving the class imbalance problem by improving the class accuracy of activity classification compared to other methods like CRF, k-NN and SVM.

3. Proposed increment under sampling for data streams (IUSDS) algorithm

This section presents the detail architecture of the proposed IUSDS approach which consists of four major modules. The detailed working principles of the IUSDS approach are explained below in the sub-sections. In the initial phase of our proposed IUSDS the dataset is split into two subsets known as majority and minority subsets. The majority subset is the class of instances, which are more in percentage than the other class. The minority subset is the class of instances which are very less when compared to the other class in the dataset. Since the proposed approach is a under sampling approach. We considered the majority subset for more elaborate analysis for reduction of instances.

The instances in the majority subset are reduced by following the below mentioned techniques; one of the technique is to eliminate the noise instances, the other technique is to find the outliers and the final technique is to find the ranges of weak instances for removal.

The noisy and outlier instances can be easily identified by analyzing the intrinsic properties of the instances. The range of weak instances can be identified by first identifying the weak features in the majority subset. The correlation based feature selection [14] technique selects the important features by following the inter correlation between feature -feature and the inter correlation between feature and class.

The features which have very less correlation are identified for elimination. The ranges of instances which belong to these weak features are identified for elimination from the majority subset. The number of features and instances eliminate by the correlation based feature selection technique will vary from dataset to dataset depending upon the unique properties of the dataset.

The detailed procedure of IUSDS is given in the form of algorithm as follows.

ALGORITHM 1. IUSDS
Input: $S$ :data stream of examples partitioned into chunks, $P$ :Chunks of instances with minority sub class, $N$ :Chunks of instances with majority sub class $jPj <jNj$ , and $Fj$ , the feature set, $j > 0$ .

Output: Average Measure {AUC, Precision, F-Measure, TP Rate, TN Rate}

Procedure:
Processing Phase:
Step 1. Take the class imbalance data and divide it into majority and minority sub sets. Let the minority subset be $P\in pi (i = 1, 2, ..., pnum)$ and majority subset be $N\in ni(i = 1, 2, ..., nnum)$ .

Selection Phase
Step 1:begin
Step 2: $k \leftarrow 0$ , $j\leftarrow 1$ .
Step 3:Apply CFS on subset $N$ ,
Step 4:Find $Fj$ from $N$ , $k=$ number of features extracted in CFS
Step 5:repeat
Step 6: $k=k+1$
Step 7:Select the range for weak or noises instances of $Fj$ .
Step 8:Remove ranges of weak attributes and form a set of major class examples N $strong$
Step 9:Until $j = k$
Step 10:Form a new dataset using $P$ and N $strong$
Step 11:End

*Building Predictive Model:*
1. Create a node $N$
2. If samples in $N\;are\;of\;same\;class,C$ *then*
3. $return\; N \;as \;a \;leaf \;node \;and\; mark\; class\; C$ ;
4. If $A \;is \;empty$ *then*
5. return $N\; as\; a\; leaf\;node \;and \;mark\; with\; majority\; class$ ;
6. else
7. apply Random Forest
8. endif
9. endif
10. $Return\; N$

In the concluding phase of the algorithm, the subset in which irrelevant instances are removed is merged with the minority subset to form the strong dataset, which is further applied to the base algorithm for experimental simulation. In this context random forest [11] is used as the base algorithm for experimental simulation and results generation.

4. Data set and evaluation criterias

In the study, 15 binary data sets are collected form UCI Repository of Machine Learning Database (School of Information and Computer Science), Irvine, CA:University of California [1], and KEEL [2] which is one of the largest machine learning repository for research community. The dataset comprises of varied application domains to form the active data stream with different sizes of imbalance ratio.

The individual stream of data arriving is in the range of 57 instances for labor dataset to 3772 instances for sick dataset. The Imbalance Ratio (IR) of individual stream of data arriving is in the range of 1.08 for mushroom dataset to 15.32 for sick dataset. The different properties of datasets such as number of majority instances, number of minority instances etc.. are given below in the Table 1.

Table 1. Details of the Dataset.

S.no	Dataset	symbol	Instances	Majority	Minority	IR
1.	Breast-cancer	B1	286	201	85	2.36
2.	Breast-w	B2	699	458	241	1.90
3.	Colic	C1	368	232	136	1.71
4.	Credit-g	C2	1,000	700	300	2.33
5.	Diabetes	D1	768	500	268	1.87
6.	Heart-c	H1	303	165	138	1.19
7.	Heart-h	H2	294	188	10	1.77
8.	Heart-stat	H3	270	150	120	1.25
9.	Hepatitis	H4	155	123	32	3.85
10.	Ionosphere	I1	351	225	126	1.79
11.	Kr-vs-kp	K1	3196	1669	1527	1.09
12.	Labor	L1	57	37	20	1.85
13.	Mushroom	M1	8124	4208	3916	1.08
14.	Sick	S1	3772	3541	231	15.32
15.	Sonar	S2	208	111	97	1.15

| Show Table

DownLoad: CSV

Table 2 presents the detailed description of the data stream formed for t time. In the initial time t chunk = 1 of B arrives with total instances of 286 with an imbalance ratio of 2.36. Int+1time another chunk arrives and total instances becomes 985 and the imbalance ratio changes to 2.02. In the same way, the data stream chunks arrives and total instances to be processed and the imbalance ratio varies. When all the fifteen chunks arrive the total instances changes to 19851 and the imbalance ratio varies to 1.63.

Table 2. Data Stream Description.

Dataset	Instances	Majority	Minority	IR
Chunk 1:{B1}	286	201	85	2.36
Chunk 2:{B1, B2}	985	659	326	2.02
Chunk 3:{B1, B2, C1}	1353	891	462	1.92
Chunk 4:{B1, B2, C1, C2}	2353	1591	1062	1.49
Chunk 5:{B1, B2, C1, C2, D1}	3121	2091	1325	1.57
Chunk 6:{B1, B2, C1, C2, D1, H1}	3424	2256	1463	1.52
Chunk 7:{B1, B2, C1, C2, D1, H1, H2}	3718	2444	1569	1.55
Chunk 8:{B1, B2, C1, C2, D1, H1, H2, H3}	3988	2594	1689	1.53
Chunk 9:{B1, B2, C1, C2, D1, H1, H2, H3, H4}	4143	2717	1721	1.57
Chunk 10:{B1, B2, C1, C2, D1, H1, H2, H3, H4, I1}	4494	2942	1847	1.59
Chunk 11:{B1, B2, C1, C2, D1, H1, H2, H3, H4, I1, K1}	7690	4611	3374	1.36
Chunk 12:{B1, B2, C1, C2, D1, H1, H2, H3, H4, I1, K1, L1}	7747	4648	3394	1.36
Chunk 13:{B1, B2, C1, C2, D1, H1, H2, H3, H4, I1, K1, L1, M1}	15871	8856	7310	1.21
Chunk 14:{B1, B2, C1, C2, D1, H1, H2, H3, H4, I1, K1, L1, M1, S1}	19643	12397	7541	1.64
Chunk 15:{B1, B2, C1, C2, D1, H1, H2, H3, H4, I1, K1, L1, M1, S1, S2}	19851	12508	7638	1.63

| Show Table

DownLoad: CSV

The evaluation metrics used in the paper are detailed below, the percentage of instances correctly classified by a classifier is known as Accuracy. AUC can be computed simple as the micro average of TP rate and TN rate when only single run is available from the clustering algorithm. The AUC is defined as the mean of true positive rate and true negative rate. The formula for AUC is give below,

$AUC = \frac{{1 + T{P_{RATE}}\;\; - F{R_{RATE}}}}{2}$

(1)

$AUC = \frac{{T{P_{RATE}}\;\; + T{N_{RATE}}}}{2}.$

(2)

The Precision measure is computed by

$\mathrm{Pr}ecision = \frac{TP}{(TP)+(FP)}.$

(3)

The Recall measure is computed by,

$\mathrm{Re}call = \frac{TP}{(TP)+(FN)}.$

(4)

The F-measure Value is computed by,

$F-measure = \frac{2\times\mathrm{Pr}ecision\times\mathrm{Re}call}{\mathrm{Pr}ecision+\mathrm{Re}call}.$

(5)

The True Positive Rate [9] measure is computed by,

$TruePositiveRate = \frac{TP}{(TP)+(FN)}.$

(6)

The True Negative Rate measure is computed by

$TrueNegativeRate = \frac{TN}{(TN)+(FP)}.$

(7)

5. Experimental results

In this section, we present the experimental results of the proposed approach with the two benchmark comparative algorithms C4.5 and Hoeffding Tree. The experimental results of comparative algorithms and proposed IOSDS are implemented and generated in the Weka environment [19]. One of the two best known algorithms are chosen for comparison so as to expose the strengths and the weakness of the proposed approach on different evaluation criteria's.

Table 3 presents the results of average Accuracy for our proposed algorithm IUSDS, with compared algorithms. We have compared two benchmark algorithms with our proposal; the first is a classical decision tree algorithm C4.5 algorithm. The results of IUSDS algorithms are far better when compared with C4.5; out of the all 15 incremental data stream chunks, our proposed IUSDS algorithm have won on all 15. The second algorithm compared with IUSDS is Hoeffding Tree, which one of the latest data stream learning algorithm. When compared with Hoeffding Tree also our proposed IUSDS algorithm have won on all 15 incremental data stream chunks in terms of accuracy.

Table 3. Average TN Rate for IUSDS verses C4.5 during the 11 time stamps after each status change for chunk-by-chunk learning.

Chunk no	C4.5	HoeffdingTree	IUSDS
Chunk 1 (maj=201; min=85)	0.260 $\bullet$	0.395 $\bullet$	0.325
Chunk 2 (maj=659; min=326)	0.596 $\bullet$	0.685 $\bullet$	0.652
Chunk 3 (maj=891; min=462)	0.636 $\bullet$	0.693 $\bullet$	0.689
Chunk 4 (maj=1591; min=762)	0.577 $\bullet$	0.642 $\bullet$	0.624
Chunk 5 (maj=2091; min=1030)	0.582 $\bullet$	0.634 $\bullet$	0.638
Chunk 6 (maj=2214; min=1062)	0.635 $\bullet$	0.674 $\bullet$	0.685
Chunk 7 (maj=2439; min=1188)	0.679 $\bullet$	0.707 $\bullet$	0.723
Chunk 8 (maj=2476; min=1208)	0.702 $\bullet$	0.738 $\bullet$	0.740
Chunk 9 (maj=6017; min=1438)	0.721 $\bullet$	0.667 $\bullet$	0.759
Chunk 10 (maj=6128; min=1536)	0.724 $\bullet$	0.657 $\bullet$	0.757
Chunk 11 (maj=6395; min=1704)	0.745 $\bullet$	0.684 $\bullet$	0.778
$\bullet$ Bold dot indicates the win of IUSDS; $\circ$ Empty dot indicates the loss of IUSDS

| Show Table

DownLoad: CSV

Table 4 presents the results of average AUC for our proposed algorithm IUSDS, with compared algorithms. The results of IUSDS algorithms are efficient when compared with C4.5; out of the all 15 incremental data stream chunks, our proposed IUSDS algorithm have won on all 15. When compared with Hoeffding Tree our proposed IUSDS algorithm have achieved 6 wins, 2 ties and 7 losses out of 15 incremental data stream chunks. The trends of accuracy, TP rate and TN Rate are shown in the figure 1 and 2 respectively. The trends in figures also show that our proposed IUSDS algorithm had performed well against the compared algorithms.

Table 4. Average Accuracy for IUSDS verses C4.5 during the 11 time stamps after each status change for chunk-by-chunk learning.

Chunk no	C4.5	HoeffdingTree	IUSDS
Chunk 1 (maj=201; min=85)	74.28 $\circ$	72.18 $\circ$	71.73
Chunk 2 (maj=659; min=326)	84.64 $\bullet$	84.09 $\bullet$	84.94
Chunk 3 (maj=891; min=462)	84.81 $\circ$	81.90 $\bullet$	84.03
Chunk 4 (maj=1591; min=762)	81.42 $\circ$	80.19 $\circ$	79.10
Chunk 5 (maj=2091; min=1030)	80.04 $\circ$	79.30 $\circ$	79.26
Chunk 6 (maj=2214; min=1062)	79.90 $\circ$	79.86 $\circ$	79.59
Chunk 7 (maj=2439; min=1188)	81.31 $\circ$	81.06 $\bullet$	81.23
Chunk 8 (maj=2476; min=1208)	80.97 $\bullet$	82.15 $\circ$	81.01
Chunk 9 (maj=6017; min=1438)	82.94	83.44 $\circ$	82.94
Chunk 10 (maj=6128; min=1536)	82.01 $\bullet$	81.92 $\bullet$	82.17
Chunk 11 (maj=6395; min=1704)	83.33 $\bullet$	83.11 $\bullet$	83.52
$\bullet$ Bold dot indicates the win of IUSDS; $\circ$ Empty dot indicates the loss of IUSDS

| Show Table

DownLoad: CSV

Figure 1. Trends in TN Rate for C4.5, Hoeffding Tree versus IUSDS on data stream.

DownLoad: Full-Size Img PowerPoint

Figure 2. Trends in FP Rate for C4.5, Hoeffding Tree versus IUSDS on data stream.

DownLoad: Full-Size Img PowerPoint

Table 5, 6, 7, 8, 9 and 10 presents the results of average precision, recall, F-measure, TP Rate, FP Rate ad TN Rate respectively for our proposed algorithm IUSDS, with compared algorithms.

Table 5. Average FP Rate for IUSDS verses C4.5 during the 11 time stamps after each status change for chunk-by-chunk learning.

Chunk no	C4.5	HoeffdingTree	IUSDS
Chunk 1 (maj=201; min=85)	0.740 $\bullet$	0.605 $\bullet$	0.675
Chunk 2 (maj=659; min=326)	0.404 $\bullet$	0.315 $\bullet$	0.348
Chunk 3 (maj=891; min=462)	0.364 $\bullet$	0.307 $\bullet$	0.311
Chunk 4 (maj=1591; min=762)	0.423 $\bullet$	0.358 $\bullet$	0.376
Chunk 5 (maj=2091; min=1030)	0.418 $\bullet$	0.366 $\bullet$	0.362
Chunk 6 (maj=2214; min=1062)	0.365 $\bullet$	0.326 $\bullet$	0.315
Chunk 7 (maj=2439; min=1188)	0.321 $\bullet$	0.293 $\bullet$	0.277
Chunk 8 (maj=2476; min=1208)	0.298 $\bullet$	0.262 $\bullet$	0.260
Chunk 9 (maj=6017; min=1438)	0.279 $\bullet$	0.333 $\bullet$	0.241
Chunk 10 (maj=6128; min=1536)	0.276 $\bullet$	0.343 $\bullet$	0.243
Chunk 11 (maj=6395; min=1704)	0.255 $\bullet$	0.316 $\bullet$	0.222
$\bullet$ Bold dot indicates the win of IUSDS; $\circ$ Empty dot indicates the loss of IUSDS

| Show Table

DownLoad: CSV

Table 6. Average AUC for IUSDS verses C4.5 during the 11 time stamps after each status change for chunk-by-chunk learning.

Chunk no	C4.5	HoeffdingTree	IUSDS
Chunk 1 (maj=201; min=85)	0.606 $\bullet$	0.683 $\circ$	0.637
Chunk 2 (maj=659; min=326)	0.782 $\bullet$	0.836 $\circ$	0.812
Chunk 3 (maj=891; min=462)	0.802 $\bullet$	0.832 $\bullet$	0.833
Chunk 4 (maj=1591; min=762)	0.764 $\bullet$	0.820 $\circ$	0.777
Chunk 5 (maj=2091; min=1030)	0.761 $\bullet$	0.818 $\circ$	0.787
Chunk 6 (maj=2214; min=1062)	0.746 $\bullet$	0.819 $\circ$	0.775
Chunk 7 (maj=2439; min=1188)	0.766 $\bullet$	0.836 $\circ$	0.795
Chunk 8 (maj=2476; min=1208)	0.761 $\bullet$	0.845 $\circ$	0.791
Chunk 9 (maj=6017; min=1438)	0.782 $\bullet$	0.813 $\circ$	0.810
Chunk 10 (maj=6128; min=1536)	0.779 $\bullet$	0.812 $\circ$	0.806
Chunk 11 (maj=6395; min=1704)	0.798 $\bullet$	0.826 $\circ$	0.821
$\bullet$ Bold dot indicates the win of IUSDS; $\circ$ Empty dot indicates the loss of IUSDS

| Show Table

DownLoad: CSV

Table 7. Average Precision for IUSDS verses C4.5 during the 11 time stamps after each status change for chunk-by-chunk learning.

Chunk no	C4.5	HoeffdingTree	IUSDS
Chunk 1 (maj=201; min=85)	0.753 $\circ$	0.774 $\bullet$	0.736
Chunk 2 (maj=659; min=326)	0.859 $\bullet$	0.881 $\bullet$	0.861
Chunk 3 (maj=891; min=462)	0.856 $\bullet$	0.866 $\bullet$	0.836
Chunk 4 (maj=1591; min=762)	0.834 $\bullet$	0.849 $\bullet$	0.800
Chunk 5 (maj=2091; min=1030)	0.827 $\bullet$	0.839 $\bullet$	0.802
Chunk 6 (maj=2214; min=1062)	0.774 $\bullet$	0.795	0.785
Chunk 7 (maj=2439; min=1188)	0.791 $\bullet$	0.802 $\circ$	0.804
Chunk 8 (maj=2476; min=1208)	0.779 $\bullet$	0.811 $\circ$	0.798
Chunk 9 (maj=6017; min=1438)	0.803 $\bullet$	0.826 $\bullet$	0.819
Chunk 10 (maj=6128; min=1536)	0.795 $\bullet$	0.807 $\bullet$	0.813
Chunk 11 (maj=6395; min=1704)	0.811 $\bullet$	0.822 $\bullet$	0.828
$\bullet$ Bold dot indicates the win of IUSDS; $\circ$ Empty dot indicates the loss of IUSDS

| Show Table

DownLoad: CSV

Table 8. Average Recall for IUSDS verses C4.5 during the 15 time stamps after each status change for chunk-by-chunk learning.

Chunk no	C4.5	HoeffdingTree	IUSDS
Chunk 1 (maj=201; min=85)	0.947 $\circ$	0.860 $\bullet$	0.909
Chunk 2 (maj=659; min=326)	0.953 $\circ$	0.906 $\bullet$	0.946
Chunk 3 (maj=891; min=462)	0.946 $\circ$	0.875 $\bullet$	0.925
Chunk 4 (maj=1591; min=762)	0.921	0.872 $\bullet$	0.888
Chunk 5 (maj=2091; min=1030)	0.901 $\bullet$	0.866 $\bullet$	0.884
Chunk 6 (maj=2214; min=1062)	0.813 $\bullet$	0.828 $\bullet$	0.820
Chunk 7 (maj=2439; min=1188)	0.814	0.830 $\bullet$	0.824
Chunk 8 (maj=2476; min=1208)	0.793	0.826 $\bullet$	0.808
Chunk 9 (maj=6017; min=1438)	0.815 $\bullet$	0.844 $\bullet$	0.828
Chunk 10 (maj=6128; min=1536)	0.806 $\bullet$	0.841 $\bullet$	0.822
Chunk 11 (maj=6395; min=1704)	0.821 $\bullet$	0.851 $\bullet$	0.833
$\bullet$ Bold dot indicates the win of IUSDS; $\circ$ Empty dot indicates the loss of IUSDS

| Show Table

DownLoad: CSV

Table 9. Average F-measure for IUSDS verses C4.5 during the 11 time stamps after each status change for chunk-by-chunk learning.

Chunk no	C4.5	HoeffdingTree	IUSDS
Chunk 1 (maj=201; min=85)	0.838 $\circ$	0.812 $\bullet$	0.812
Chunk 2 (maj=659; min=326)	0.900 $\bullet$	0.890 $\bullet$	0.898
Chunk 3 (maj=891; min=462)	0.896	0.867 $\bullet$	0.874
Chunk 4 (maj=1591; min=762)	0.873 $\bullet$	0.857 $\bullet$	0.838
Chunk 5 (maj=2091; min=1030)	0.860 $\bullet$	0.849 $\bullet$	0.838
Chunk 6 (maj=2214; min=1062)	0.785 $\bullet$	0.803 $\bullet$	0.791
Chunk 7 (maj=2439; min=1188)	0.794 $\bullet$	0.808 $\bullet$	0.804
Chunk 8 (maj=2476; min=1208)	0.774 $\bullet$	0.810 $\circ$	0.790
Chunk 9 (maj=6017; min=1438)	0.798 $\bullet$	0.827 $\bullet$	0.813
Chunk 10 (maj=6128; min=1536)	0.790 $\bullet$	0.815 $\bullet$	0.807
Chunk 11 (maj=6395; min=1704)	0.807 $\bullet$	0.828 $\bullet$	0.821
$\bullet$ Bold dot indicates the win of IUSDS; $\circ$ Empty dot indicates the loss of IUSDS

| Show Table

DownLoad: CSV

Table 10. Average FN Rate for IUSDS verses C4.5 during the 11 time stamps after each status change for chunk-by-chunk learning.

Chunk no	C4.5	HoeffdingTree	IUSDS
Chunk 1 (maj=201; min=85)	0.053 $\circ$	0.140 $\circ$	0.091
Chunk 2 (maj=659; min=326)	0.047 $\circ$	0.094 $\circ$	0.054
Chunk 3 (maj=891; min=462)	0.054 $\circ$	0.125 $\bullet$	0.075
Chunk 4 (maj=1591; min=762)	0.079	0.128 $\bullet$	0.112
Chunk 5 (maj=2091; min=1030)	0.099 $\circ$	0.134 $\bullet$	0.116
Chunk 6 (maj=2214; min=1062)	0.187 $\circ$	0.172 $\bullet$	0.180
Chunk 7 (maj=2439; min=1188)	0.186	0.170 $\bullet$	0.176
Chunk 8 (maj=2476; min=1208)	0.207	0.174 $\bullet$	0.192
Chunk 9 (maj=6017; min=1438)	0.185 $\circ$	0.156 $\bullet$	0.172
Chunk 10 (maj=6128; min=1536)	0.194 $\circ$	0.159 $\bullet$	0.178
Chunk 11 (maj=6395; min=1704)	0.179 $\circ$	0.149 $\circ$	0.167
$\bullet$ Bold dot indicates the win of IUSDS; $\circ$ Empty dot indicates the loss of IUSDS

| Show Table

DownLoad: CSV

The results of IUSDS algorithms are efficient when compared with C4.5 and Hoeffding Tree on all 15 incremental data stream chunks. The summary of the wins, ties and losses are given in the table 11. The proposed IUSDS approach is efficient to solve the real time issues of knowledge discovery for imbalance and noisy data streams. The experimental results clearly indicate that, the IUSDS approach performs better on the data streams than the classical approaches.

Table 11. Summary of experimental results for IUSDS.

Results	Systems	Wins	Ties	Losses
TN Rate	IUSDS v/s C4.5	11	0	0
	IUSDS v/s HoeffdingTree	11	0	0
Accuracy	IUSDS v/s C4.5	04	1	6
	IUSDS v/s HoeffdingTree	05	0	6
FP Rate	IUSDS v/s C4.5	11	0	0
	IUSDS v/s HoeffdingTree	11	0	0
AUC	IUSDS v/s C4.5	11	0	0
	IUSDS v/s HoeffdingTree	1	0	10
Precision	IUSDS v/s C4.5	10	0	1
	IUSDS v/s HoeffdingTree	08	1	02
Recall	IUSDS v/s C4.5	05	03	03
	IUSDS v/s HoeffdingTree	11	0	0
F-measure	IUSDS v/s C4.5	09	01	01
	IUSDS v/s HoeffdingTree	10	00	01
FN Rate	IUSDS v/s C4.5	05	03	03
	IUSDS v/s HoeffdingTree	11	00	00

| Show Table

DownLoad: CSV

6. Conclusion

In this paper, Incremental Under Sampling for Data Stream algorithm is proposed using an efficient and unique under sampling strategy. IUSDS approach under samples the prominent and class oriented instances from the minority subset for better knowledge discovery from the data streams. The experimental results suggest that the IUSDS approach efficiently discovers knowledge from the imbalance data streams. In future work, we want to mitigate the solution for the problem of concept drift for imbalance data streams.

References

[1]	Alcalá-Fdez J., Fernandez A., Luengo J., Derrac J., García S., Sánchez L., Herrera F. (2011) KEEL data-mining software tool: Data set repository, Integration of Algorithms and Experimental Analysis Framework. Journal of Multiple-Valued Logic and Soft Computing 17: 255-287.
[2]	A. Asuncion and D. J. Newman, UCI Repository of Machine Learning Database (School of Information and Computer Science), Irvine, CA: Univ. of California [Online], 2007. Available: http://www.ics.uci.edu/mlearn/MLRepository.html
[3]	Brown I., Mues C. (2012) An experimental comparison of classification algorithms for imbalanced credit scoring data sets. Expert Systems with Applications 39: 3446-3453. doi: 10.1016/j.eswa.2011.09.033
[4]	Cao P., Zhao D., Zaiane O. (2013) A PSO-based cost-sensitive neural network for imbalanced data classification. Trends and Applications in Knowledge Discovery and Data Mining 452-463. doi: 10.1007/978-3-642-40319-4_39
[5]	Y. Chen, Learning Classifiers from Imbalanced Only Positive and Unlabeled Data Sets 2008 UC San Diego Data Mining Contest.
[6]	Chen Y., Tang S., Zhou L., Wang C., Du J., Wang T., Pei S. (2018) Decentralized Clustering by Finding Loose and Distributed Density Cores. Inform. Sci. 433/434: 510-526. doi: 10.1016/j.ins.2016.08.009
[7]	Doucette, Heywood M. I. (2008) Classification under imbalanced data sets:Active sub-sampling and auc approximation. M. O'Neill et al. Eds.:EuroGP 2008, LNCS 4971: 266-277.
[8]	Frey B. J., Dueck D. (2007) Clustering by passing messages between data points. Science 315: 972-976. doi: 10.1126/science.1136800
[9]	G. Hulten, L. Spencer and P. Domingos, Mining time-changing data streams, In: ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, (2001), 97-106. 10.1145/502512.502529
[10]	Jain A. K. (2008) Data clustering:50 years beyond K-means. Part of the Lecture Notes in Computer Science book series 5211: 3-4. doi: 10.1007/978-3-540-87479-9_3
[11]	R. Kohavi, Scaling up the accuracy of Naive-Bayes classifiers: A decision-tree hybrid, In: Second International Conference on Knoledge Discovery and Data Mining, (1996), 202-207.
[12]	López V., Triguero I., Carmona C. J., García S., Herrera F. (2014) Addressing imbalanced classification withinstance generation techniques: IPADE-ID. Neurocomputing 126: 15-28. doi: 10.1016/j.neucom.2013.01.050
[13]	Lorena A. C., Jacintho L. F. O., Siqueira M. F., De Giovanni R., Lohmann L. G., de Carvalho A. C. P. L. F., Yamamoto M. (2011) Comparing machine learning classifiers in potential distribution modelling. Expert Systems with Applications 38: 5268-5275. doi: 10.1016/j.eswa.2010.10.031
[14]	H. Ma, Correlation-based Feature Subset Selection For Machine Learning PhD Thesis, 1998.
[15]	A. K. Menon, H. Narasimhan, S. Agarwal and S. Chawla, On the statistical consistency of algorithms for binary classification under class imbalance, Appearing in Proceedings of the 30 thInternational Conference on Machine Learning Atlanta, Georgia, USA, 2013.
[16]	Rodriguez A., Laio A. (2014) Clustering by fast search and find of density peaks. Science 344: 1492-1496. doi: 10.1126/science.1242072
[17]	Verbiesta N., Ramentol E., Cornelisa C., Herrera F. (2014) Preprocessing noisy imbalanced datasets using SMOTE enhanced withfuzzy rough prototype selection. Applied Soft Computing 22: 511-517. doi: 10.1016/j.asoc.2014.05.023
[18]	Wang S., Minku L. L., Yao X. (2015) Resampling-based ensemble methods for online class imbalance learning. IEEE Transactions on Knowledge and Data Engineering 27: 1356-1368. doi: 10.1109/TKDE.2014.2345380
[19]	Witten I. H., Frank E. (2002) Data mining:Practical machine learning tools and techniques. Newsletter: ACM SIGMOD Record Homepage Archive 31: 76-77. doi: 10.1145/507338.507355
[20]	B. Yang and L. Jing, A Novel nonparallel plane proximal svm for imbalance data classification Journal of Software, 9 2014.

Reader Comments

Your name:*

Email:*
© 2018 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

Big Data and Information Analytics

Metrics

Article views(4842) PDF downloads(1792) Cited by(0)

Preview PDF

Download XML

Export Citation

Article outline

Show full outline

Figures and Tables

Figures(2) / Tables(11)

Big Data and Information Analytics

A novel approach using incremental under sampling for data stream mining

Related Papers:

Abstract

1. Introduction

2. Related work

3. Proposed increment under sampling for data streams (IUSDS) algorithm

4. Data set and evaluation criterias

5. Experimental results

6. Conclusion

References

Reader Comments

通讯作者: 陈斌, bchen63@163.com

Metrics

Figures and Tables

Other Articles By Authors

Catalog

Big Data and Information Analytics

A novel approach using incremental under sampling for data stream mining

Related Papers:

Abstract

1. Introduction

2. Related work

3. Proposed increment under sampling for data streams (IUSDS) algorithm

4. Data set and evaluation criterias

5. Experimental results

6. Conclusion

References

Reader Comments

通讯作者: 陈斌, bchen63@163.com

Metrics

Figures and Tables

Other Articles By Authors

Related pages

Tools

Export File

Citation

Format

Content

Catalog