Research article Special Issues

CDBC: A novel data enhancement method based on improved between-class learning for darknet detection


  • With the development of the Internet, people have paid more attention to privacy protection, and privacy protection technology is widely used. However, it also breeds the darknet, which has become a tool that criminals can exploit, especially in the fields of economic crime and military intelligence. The darknet detection is becoming increasingly important; however, the darknet traffic is seriously unbalanced. The detection is difficult and the accuracy of the detection methods needs to be improved. To overcome these problems, we first propose a novel learning method. The method is the Chebyshev distance based Between-class learning (CDBC), which can learn the spatial distribution of the darknet dataset, and generate "gap data". The gap data can be adopted to optimize the distribution boundaries of the dataset. Second, a novel darknet traffic detection method is proposed. We test the proposed method on the ISCXTor 2016 dataset and the CIC-Darknet 2020 dataset, and the results show that CDBC can help more than 10 existing methods improve accuracy, even up to 99.99%. Compared with other sampling methods, CDBC can also help the classifiers achieve higher recall.

    Citation: Binjie Song, Yufei Chang, Minxi Liao, Yuanhang Wang, Jixiang Chen, Nianwang Wang. CDBC: A novel data enhancement method based on improved between-class learning for darknet detection[J]. Mathematical Biosciences and Engineering, 2023, 20(8): 14959-14977. doi: 10.3934/mbe.2023670

    Related Papers:

    [1] Jin Xie, Xiaoyan Liu, Lei Zhu, Yuqing Ma, Ke Zhang . The C3 parametric eighth-degree interpolation spline function. AIMS Mathematics, 2023, 8(6): 14623-14632. doi: 10.3934/math.2023748
    [2] Lishan Fang . Smooth digital terrain modeling in irregular domains using finite element thin plate splines and adaptive refinement. AIMS Mathematics, 2024, 9(11): 30015-30042. doi: 10.3934/math.20241450
    [3] Salwa Syazwani Mahzir, Md Yushalify Misro . Enhancing curve smoothness with whale optimization algorithm in positivity and monotonicity-preserving interpolation. AIMS Mathematics, 2025, 10(3): 6910-6933. doi: 10.3934/math.2025316
    [4] Yanping Liu, Juliang Yin . B-spline estimation in varying coefficient models with correlated errors. AIMS Mathematics, 2022, 7(3): 3509-3523. doi: 10.3934/math.2022195
    [5] Abdul Majeed, Muhammad Abbas, Amna Abdul Sittar, Md Yushalify Misro, Mohsin Kamran . Airplane designing using Quadratic Trigonometric B-spline with shape parameters. AIMS Mathematics, 2021, 6(7): 7669-7683. doi: 10.3934/math.2021445
    [6] Salwa Syazwani Mahzir, Md Yushalify Misro, Kenjiro T. Miura . Preserving monotone or convex data using quintic trigonometric Bézier curves. AIMS Mathematics, 2024, 9(3): 5971-5994. doi: 10.3934/math.2024292
    [7] Lei Hu . A weighted online regularization for a fully nonparametric model with heteroscedasticity. AIMS Mathematics, 2023, 8(11): 26991-27008. doi: 10.3934/math.20231381
    [8] Fang Cheng, Ye Hu, Mati ur Rahman . Analyzing the continuity of the mild solution in finite element analysis of semilinear stochastic subdiffusion problems. AIMS Mathematics, 2024, 9(4): 9364-9379. doi: 10.3934/math.2024456
    [9] Abdul-Majeed Ayebire, Inderpreet Kaur, Dereje Alemu Alemar, Mukhdeep Singh Manshahia, Shelly Arora . A robust technique of cubic Hermite splines to study the non-linear reaction-diffusion equation with variable coefficients. AIMS Mathematics, 2024, 9(4): 8192-8213. doi: 10.3934/math.2024398
    [10] Rolly Czar Joseph Castillo, Renier Mendoza . On smoothing of data using Sobolev polynomials. AIMS Mathematics, 2022, 7(10): 19202-19220. doi: 10.3934/math.20221054
  • With the development of the Internet, people have paid more attention to privacy protection, and privacy protection technology is widely used. However, it also breeds the darknet, which has become a tool that criminals can exploit, especially in the fields of economic crime and military intelligence. The darknet detection is becoming increasingly important; however, the darknet traffic is seriously unbalanced. The detection is difficult and the accuracy of the detection methods needs to be improved. To overcome these problems, we first propose a novel learning method. The method is the Chebyshev distance based Between-class learning (CDBC), which can learn the spatial distribution of the darknet dataset, and generate "gap data". The gap data can be adopted to optimize the distribution boundaries of the dataset. Second, a novel darknet traffic detection method is proposed. We test the proposed method on the ISCXTor 2016 dataset and the CIC-Darknet 2020 dataset, and the results show that CDBC can help more than 10 existing methods improve accuracy, even up to 99.99%. Compared with other sampling methods, CDBC can also help the classifiers achieve higher recall.



    Functional data analysis (FDA) is a statistical technique used to analyze data consisting of functions or data produced by underlying functions. The FDA aims to provide exploratory and inferential tools for analyzing curve and longitudinal data with minimum constraints on the parameters involved [1]. This collection includes techniques such as functional principal component analysis for dimension reduction [2,3], functional regression [4,5], and functional clustering and classification [6,7]. Due to notable progress in methodology and software tools, the development of the FDA has become a firmly established subject in nonparametric statistics. Distinct fields have utilized FDA, such as image analysis [8], studying the transmission of diseases like COVID-19 [9], and growth curve analysis [10].

    The application of FDA to climate variables has recently gained significant attention. A study by [11] proposed a functional time series approach for hourly air temperature forecasts, enabling ultra-short period predictions that traditional methods cannot achieve. Similarly, [12] introduced a new spatial functional data analysis approach to evaluate the performance of 18 CORDEX regional climate model (RCM) simulations for the European domain (EURO-CORDEX) in predicting average temperatures in Italy. This approach addresses the limitation of traditional climate model selection, which typically focuses only on average values across time, by considering the overall mean of the function rather than detailed temporal behavior. An innovative method of functional principal component analysis (fPCA) for incomplete space-time data has also been introduced in research by [13], allowing for the identification of main variability patterns in temperature data. According to [14], the initial stage of FDA involves smoothing [15,16] or interpolation [17], which consists of converting discrete data into a function. Interpolation is applied when discrete values are assumed to be without error, whereas smoothing transforms data into a functional form by removing any observational errors [14].

    While interpolation can be performed using beta spline, it is inadequate when derivative information of the data is required [14]. One of the main objectives of the FDA is to study important patterns and variations in the data, which requires accurate derivative information. Derivative information, or the rate of change, can only be extracted when the data is in the form of a function, which is possible only when the data is approximated. Interpolation using beta spline can be done, but typically, it results in a very straight line, which does not effectively represent the functional form of large datasets. This is because the curve is merely interpolated between one data point to another, creating straight lines rather than capturing the data's complexity. Therefore, in the FDA framework, Fourier or B-spline basis are generally used for approximation rather than interpolation, as highlighted by [15,16,18,19]. Thus, the approximation method is preferred to represent and analyze the data adequately.

    Smoothing is a technique for identifying a sequence of numbers that accurately represents the trend in a given dataset. This technique is commonly used for time series data with variations or seasonality [20]. Smoothing data eliminates noise or random fluctuations to improve the clarity of patterns and trends. The next step is data visualization, which involves creating visual representations of data to gain insights and identify patterns and trends. The roughness penalty approach (RPA) and generalized cross-validation (GCV) are popular approaches for smoothing discrete data and determining the best parameters for converting it to a functional form. Other than RPA, recent studies have also proposed spline approximation methods for smoothing time series with extreme events. The study by [21] presented a spline Hermite quasi-interpolation method for filling in missing data and smoothing univariate time series. This model can be used for forecasting and detecting anomalies. An entropy-based weighting methodology for determining spline approximations of multivariate time series was introduced by [22]. The method demonstrated to effectively mitigate the impact of outliers and noise even when handling large and highly noisy datasets. The most commonly used basis splines are B-spline and Fourier. The B-spline basis is widely used to fit nonperiodic data due to its characteristics that allow the curve to be more flexible and its efficient modeling of time-varying patterns [23]. In contrast, the Fourier basis is often chosen for its fast processing speed and ability to handle periodic data [24].

    Within the FDA framework for analyzing climatic data, Fourier basis functions were utilized in a study by [19] to smooth temperature data due to its periodic structure. In another investigation by [25], monthly maximum temperature variation was explored using B-spline basis functions. Gaussian basis functions were chosen in [26] for their ability to effectively smooth functional data and capture underlying patterns. This approach strikes a balance between overfitting and underfitting, thereby enhancing model performance. Conversely, in a study by [27], Fourier basis functions were initially employed for data smoothing, justified by the periodicity of the air temperature series. However, the use of B-splines yielded slightly better forecast results due to their ability to provide a good balance between model flexibility and overfitting in capturing complex patterns within the data, ensuring accurate forecasts [27].

    The choice of basis functions is crucial as it directly impacts the model's ability to represent the data accurately and make reliable analysis, such as in forecasting [27]. When presenting a set of data points, utilizing a spline with greater flexibility is advisable as it enhances the smoothing process, leading to improved forecast analysis performance. Many new basis functions have recently been developed to improve surface and curve flexibility, such as by Ammad et al. [28] and Said Mad Zain et al. [29]. These new basis functions provide new shape parameters to flexibly alter the shape of the curve by retaining the existing control points and conveniently processing shape changes. In 1981, Barsky [30] developed the beta spline, a flexible extension of the B-spline. The beta spline has distinct advantages over other splines as it achieves G2 continuity while also being characterized by two additional parameters that impact the curve's shape. Its curve and surface shape can be adjusted without altering control points [31]. Beta splines generally create flexible shapes and have a smoother appearance than those generated by Bézier curves [32].

    Due to its ability to model curves with flexibility, the beta spline is valuable for study in image processing and machine vision [33]. The beta spline curve provides enhanced capabilities in making 2D graphics, such as digital khat calligraphy, by offering smoother shapes, control over vertex movement, and assuring continuity and flexibility [34]. According to [35], beta spline interpolation is a technique that provides the most accurate and smooth curve fitting by selecting the curve that is closest to the data points. Beta spline surfaces can also be generated by parallel computation, enabling faster processing and handling of large datasets. The parallel beta spline method in surface fitting yields efficient and precise outcomes. According to [36], incorporating the parallel method does not alter the surface structure, guaranteeing the integrity of the reconstructed surface.

    This research seeks to integrate cubic beta spline in smoothing climate data observations into a functional form, aiming to develop a new and flexible technique for data smoothing in functional data analysis. In the initial phase, this research integrates spline smoothing using cubic beta spline and RPA to transform the discrete data into a curve. Next, the shape parameters of the spline will be optimized by applying a method known as GCV. This optimization method is applied to help determine the optimal combination of shape parameters to obtain the best-fitted curve. Finally, the rainbow plot is used to represent the best-fit temperature curve of each meteorology station in north Peninsular Malaysia.

    This paper is structured as follows: In Section 2, the cubic beta spline basis is defined, the curve is constructed, and the effects of manipulating the shape parameters are explained. Next, the smoothing approach is presented in Section 3, together with the calculation for finding the smoothing parameter value, λ. The formula for selecting shape parameters, GCV, is described in Section 4. Section 5 presents the smoothing results by implementing the shape parameters value for optimal, overfitted, and underfitted shape parameters. Finally, Section 6 provides the conclusion and a few possible directions for further research.

    This study focused on applying beta spline in the FDA framework, which was developed by Barsky during his doctoral research [30]. The development of the basis was also extensively studied in [37,38]. The expression for the beta spline curve of degree 3 is given by Eq (2.1),

    F(t)=[T][M][V]T (2.1)

    where [T]= (t3t2t1) is the polynomial matrix, [V]= (V1V2V3V4) is the control point vector and [M] is the beta spline basis function matrix that is given as the following Eq (2.2):

    [M]=1δ(2β312(β2+β31+β21+β1)2(β2+β21+β1+126β313(β2+2β31+2β21)3(β2+2β21)06β316(β31β1)6β102β31β2+4(β21+β1)20) (2.2)

    where δ=β2+2β31+4β21+4β1+2, with β1 (bias) and β2 (tension) are the shape parameters of the spline.

    By implementing these shape parameters, one can manipulate the shape without changing the control points while simultaneously introducing G2 continuity to ensure that the fitted curve maintains accuracy and smoothness. The beta spline curve also satisfies the properties of locality, convex hull, and end conditions. However, end conditions for beta splines are often handled differently than those for other splines, such as B-splines, which will be further discussed in the following subsection. When constructing the beta spline curve, determining an adequate number of data points is not a concern. Each curve segment of the beta spline can be continuously connected between data points without needing intermediate points. In contrast, methods like B-splines or Fourier may encounter issues with having either too many or too few data points. For instance, B-splines and Fourier require an exact number of data points based on the chosen degree of basis functions, while Bézier splines depend on the number of control points. Beta splines, however, rely more on the knots. Consequently, an excessive or insufficient number of data points does not pose an issue when using beta splines, unlike other spline methods.

    Figure 1 is a beta spline basis and its curve with β1=1 and β2=0, which also has a similar shape with a B-spline without repeated knots. The comparison curves representing B-splines both with repeated knots and without repeated knots and beta splines are depicted in Figure 2. Notably, the curvature of the B-spline with repeated knots differs from the others due to variations in the underlying basis values. When repeated knots are absent, the B-spline curve resembles the beta spline curve, particularly when β1=1 and β2=0. However, for different values of β1 and β2, the beta spline basis does not maintain C2 continuity at knots and instead achieves G2 continuity [39]. Examples illustrating this behavior are presented in the next subsection.

    Figure 1.  Beta spline with β1=1 and β2=0.
    Figure 2.  B-spline and beta spline (β1=1 and β2=0) curves.

    Even periodicity or seasonal variation seems to be one of the salient features of environmental or climate data, such as precipitation and temperature; splines that provide flexibility, like a beta spline or B-spline, are preferred in this study. In a study by [40], the use of the B-spline basis was implied, offering a diverse range of functions to capture the variability in simulated climatic time series. Similarly, [41] employed cubic B-splines as basis functions for weather time series data. B-splines are commonly chosen for their ability to derive higher-order derivative functions, facilitating the analysis of the rate of change in weather variables over time and allowing for a comprehensive analysis of weather variations.

    The shape of the beta curve for the first segment, from day 1 until day 4, is shown in Figure 3, illustrating how its parameters vary. Among the curves depicted, the blue one displays the least error compared to the red and black curves. This indicates that the blue curve closely follows the control points of the data. In contrast, the red and black curves have very similar shapes. However, the black curve has a higher error value and is positioned farther from the third control point. Therefore, the red curve, characterized by β1=2 and β2=1, emerges as the most suitable representation of the control points in the first segment.

    Figure 3.  Beta spline curve with different shape parameters at the first curve.

    In Figure 3, sharp curves are observed at the endpoints of each segment, specifically from day 1 to day 4, day 4 to day 7, and day 7 to day 10. These sharp turns exhibit discontinuities at each segment's initial and final points. This discontinuity results from the constant data points at the segment endpoints. While repeated data points at these endpoints are necessary to ensure the beta spline curves reach them, this requirement introduces discontinuities at the segment boundaries. Additionally, another example demonstrates how distinct curves with different shape parameters converge at the same control point. Unlike the previous example, where the entire curve was segmented into four parts, leading to discontinuities at every endpoint, this approach shows that segments with different shape parameters can also achieve good continuity solutions. This is validated in Figure 4.

    Figure 4.  Beta spline curve with different shape parameters.

    The first shape parameter of the beta spline is β1, also known as bias. When β1 is increased, the "velocity" at which one traverses a curve (from left to right, for example) to the right of a joint is greater than the "velocity" just to the left of the joint [37]. This introduces a bias into the curve, where when values exceed one, the unit tangent vector at the joint (which is continuous) exerts a more substantial influence toward the right rather than the left [37].

    The example can be seen in Figures 5 and 6, where each plot is computed for a distinct value of β1, which determines the relative magnitude of the slopes to the left and right of each joint. In Figure 5, it can be said that the basis is biased to the right as β1 increases. The curve also will extend further in the direction of the tangent in the rightmost segment. The effect of increasing the value of β1 can be seen in how the pink and cyan lines keep expanding to the left with the increasing values of the bias parameter in Figure 6.

    Figure 5.  Effect of increasing β1.
    Figure 6.  Effect of increasing β1 on curve.

    On the other hand, the second parameter β2, known as tension, controls the tension inside the curve. By changing the value of β2, the joint between two segments is moved along a vector that passes through the control vertex. This action is performed simultaneously for all the joints that make up the uniformly shaped curve. As shown in Figure 7, it is apparent that as β2 increases, the basis function's peak approaches value one, and its "tails, " which are located in the support's leftmost and rightmost intervals, approach zero. To illustrate, as the parameter β2 value increases, each joint is displaced toward its corresponding control vertex, causing the curve to flatten to the control polygon, as shown in Figure 8.

    Figure 7.  Effect of increasing β2.
    Figure 8.  Effect of increasing β2 on curve.

    As presented by Barsky, the range of shape parameters encompasses all real numbers. If the values of β1 are more than 0 and β2 are greater than or equal to 0, they constitute a basis. This means that they are linearly independent, and every segment of a beta spline curve may be described as a linear combination of these values. Furthermore, the coefficients used to combine these basis functions are distinct since the basis functions do not depend on each other. Thus, each segment of a beta spline curve with β1>0 and β2>0 may be expressed uniquely as a linear combination of these basis functions, and the combination coefficients correspond to the control vertices associated with the curve.

    The values of negative β can also be used; however, in the context of this study, particularly for data interpolation, negative β values are not feasible. It was found that when applying negative values of β, the basis does not satisfy positivity as reflected on x=0. Furthermore, the curve exhibits an unsuitable irregular shape and fails to depict the data trend accurately. The curve generated from negative values is considered suitable for application in various geometrical contexts but unsuitable for data smoothing purposes.

    The beta spline curve typically never starts from a control vertex or any point along the line segment between control points V0 and V1. The starting point is inside the convex hull formed by V0, V1, and V2. They are often handled distinctly to exert more precise control over the endpoints. To make the beta spline curves touch the endpoints, the following condition on the vector of control points [V] at the first and second beta spline curves is to be defined as follows: [V1]= (v0v0v0v1) and [V2]= (v0v0v1v2). The same process goes on to define the second to last of beta spline curves [V4]= (v1v2v3v3) and [V5]= (v2v3v3v3). The previous definitions are needed to fit five beta spline curve segments in the same control polygon to make them touch both endpoints [v0] and [v3]. The example of applying and not applying the conditions for endpoints is shown in Figure 9. In the left figure, it can be seen that the curve does not connect the endpoints of the control polygon. The above condition must be fulfilled to make the beta spline curves touch the endpoints; refer to the right figure.

    Figure 9.  Comparison of beta curve with end conditions.

    According to [42], combining the penalized roughness method with the GCV criterion is one strategy for selecting the optimal smoothing parameter. The model's performance can be evaluated using the GCV criterion by comparing the actual and smoothed data. The RPA is one way to control the smoothness of the model by incorporating a penalty term into the objective function. According to [14], this roughness penalty term, R, should be determined and calculated first. Let ϕ be the K-vector of the basis function where K is the total number of basis, and the roughness penalty term, R, can be calculated using Eq (3.1).

    R=D2ϕD2ϕ. (3.1)

    The roughness penalty matrix R defined in Eq (3.1) is composed of the integrals of the outer products of the second derivative D2 of basis functions ϕ. The notation ϕ is for the transpose of vector ϕ.

    Next, data smoothing was carried out before obtaining the smoothing parameter value, λ, that will be used in the GCV process. Let yj with j=1,,n be the discrete observations, using the model yj=x(tj)+εj and basis function expansion of the x(t) in the form of x(t)=Kkckϕk=Φc, where Φ is the basis function matrix and c is the K-vector of coefficients, and it can be obtained in Eq (3.2).

    ˆc=(ΦWΦ+λR)1ΦWy. (3.2)

    In Eq (3.2), λ is the smoothing parameter, Φ is the n total observations by K matrix containing the values ϕk(tj), and W is the symmetric positive definitive matrix that allows for unequal weighting of squares and products of residuals which assume to be the identity matrix, I if the standard model is assumed. Then, the data-fitting vector ˆy is as follows,

    ˆy=Φ(ΦWΦ+λR)1ΦWy. (3.3)

    When fitting data with a roughness penalty approach, the smoothing parameter, λ, is employed to regulate the smoothness of the curve. This parameter λ mediates the trade-off between achieving an accurate fit to the data and preserving the smoothness of the function x. To determine the appropriate value of λ, the equation representing the fitted curve using RPA, denoted as ˆy in Eq (3.3), is equated with the beta spline curve from Eq (2.1). This beta spline curve is also expressed as a linear combination of the beta spline basis and the control points, where the equation can also be written as F(t)=Φy=ˆy. The smoothing parameter λ calculation is detailed in Eq (3.4), where its value varies with different values of shape parameters, β1 and β2. Let ˆy=F(t) and W=I, where I is the identity matrix,

    Φy=Φ(ΦΦ+λR)1Φy1=Φ(ΦΦ+λR)1(ΦΦ+λR)=ΦλR=ΦΦΦλ=(ΦΦΦ)R1. (3.4)

    To regulate the model's smoothness and minimize the difference between observed and predicted data, the GCV criteria are used to choose the optimal smoothing parameter. Craven and Wahba developed this algorithm in 1979 [43]. It was initially designed as a simplified alternative to cross-validation, eliminating the necessity for n iterations of smoothing. However, it has demonstrated greater accuracy than cross-validation due to its reduced propensity for under-smoothing. It is also a popular approach for spline smoothing [14]. This study employed the method of GCV to find the best combination values of beta spline curve parameters by optimizing the shape parameters, aiming to achieve a smooth and accurately fitting curve. The equation of GCV is given as follows,

    GCV=(nndf(λ))(SSEndf(λ)). (4.1)

    In Eq (4.1), SSE=jn(yjx(tj))2 and df(λ)=trace[H(λ)] with H=Φ(ΦΦ+λR)1Φ. Figure 10 depicts the flowchart of the proposed method to improve further understanding. Algorithm 1 outlines the steps required to transform discrete temperature data points into a smooth, continuous curve by applying cubic beta splines, ensuring the optimal representation of the underlying temperature trends.

    Figure 10.  Flowchart of the proposed data smoothing technique.

    Algorithm 1 Cubic beta curve smoothing process
    (ⅰ) Input data points in vector [V].
    (ⅱ) Compute the values of beta spline basis matrix [M] as discussed in Section 2.
    (ⅲ) Calculate the value of roughness penalty term, R=D2ϕD2ϕ.
    (ⅳ) Evaluate the values of smoothing parameter, λ as formulated in Eq 3.4.
    (ⅴ) Compute the GCV errors as in Eq 4.1 for each combination of β1 and β2.
    (ⅵ) Construct the cubic beta curve, F(t), using the optimal shape parameter values obtained from GCV.

    The proposed approach was utilized with the temperature data from meteorological stations in northern Peninsular Malaysia for January 2022. This study faced the challenge of determining the optimal combination of parameters as changes in β1 and β2 can modify the curve's shape independently of the control vertices. Thus, GCV was utilized to determine the optimal value of the shape parameters. The range was selected for β1>0 and β20 to ensure linear independence. The ranges were adjusted to study their impact on the optimized shape parameter values. The method's approximation errors are compared with those using standard shape parameter values, β1=1 and β2=0. Before optimization, underfitting and overfitting can be assessed by evaluating the smoothness of the curve. An over-smoothed curve that fails to capture the underlying pattern of the data points indicates underfitting. Conversely, an under-smoothed curve that lacks smoothness and captures unnecessary data details indicates overfitting. It is crucial to plot the best-fitted curve with low error and optimal smoothness during the transformation process, as overfitting and underfitting curves impact subsequent FDA procedures differently. Overfitting introduces irrelevant details, complicates result interpretation, and increases computational costs, while underfitting removes useful information that cannot be recovered later [44].

    For GCV, two values are crossed over, and the procedure consists of exploring several combinations of the crossed parameter values, computing their GCV error for each of them, and choosing the combination that yields the low error with optimal smoothness. Based on the GCV error values, a high error indicates that the curve is over-smoothed and underfitting the data. Conversely, a very low error suggests that the curve is under-smoothed and overfitting the data. Therefore, in this study, the optimal curve is determined to lie between these extremes of overfitting and underfitting. The guidelines followed by this study on the effect of choosing different parameters for smoothing have also been recently practiced by [45] and were well-demonstrated by [44].

    The result of this approach is presented in Figure 11, with the darker pink shade representing the lowest error and the darker blue shade representing the highest error. The color grid displays the GCV values for each combination of shape parameters. As shown in Figure 11, the minimum GCV error in the grid corresponds to β1=8 and β2=0, represented by the darkest shade of pink. Figure 12 illustrates a curve derived from solving Eq (2.1) using the provided values. The curve is not smooth, exhibits sharp edges, and closely follows the control polygon, capturing all data details. In the meantime, the highest value of GCV error is produced by the combination of β1=1 and β2=0. It is illustrated in Figure 13 that the curve that contains these values is over-smoothed. Because the curve does not cover all the points, it may overlook some significant aspects of the data. Since the objective is not just to create a smooth curve but rather a curve that is best suited to the data, it is unreasonable to represent the data using the aforementioned curves.

    Figure 11.  Color grid showing the GCV error for several (β1, β2) combinations.
    Figure 12.  Temperature curve with the lowest GCV error.
    Figure 13.  Temperature curve with the highest GCV error.

    In Figure 14, the comparison of the curves with shape parameters that yield the highest and lowest errors are plotted. The graph shows that the blue curve does not capture almost all the extreme minimum and maximum temperature data points, such as on days 3, 7, 11, 14, 18, 25, 26, and 30. Meanwhile, the green curve that produces the lowest error almost touches all the extreme points, capturing every data detail. Curve shape adjustments can be made on these segments with extreme data points by altering the value of shape parameters at specific intervals or curve segments.

    Figure 14.  Comparison between the curves with the highest and lowest GCV error.

    Figure 15 illustrates how the beta curve enhances specific segments, particularly those with extreme values, by adjusting the shape parameters at those segments. However, the figure reveals discontinuities at each joint between curves. Consequently, an improved curve is presented in Figure 16, which effectively represents the data while also maintaining geometric continuity. This improvement utilizes the restricted form of quintic Hermite interpolation introduced by [37], allowing the definition of distinct shape parameters at each joint without compromising geometric continuity. This method assigns new parameters, α1 and α2, to β1 and β2, which are the shape parameters at the joint between two curve segments. The beta basis segments are constructed such that β1 and β2 are functions of the knot value. These functions interpolate smoothly between α1 and α2 at each end of a segment. Consequently, the curve maintains G2 continuity, ensuring smooth transitions without abrupt changes in direction or curvature at the segment joints. This approach guarantees that the shape parameters transition smoothly from one segment to the next, preserving the overall smoothness of the curve while allowing for specific adjustments at the joints.

    Figure 15.  Improved beta curve segments with adjusted shape parameter values.
    Figure 16.  Improved beta curve segments with adjusted shape parameter values while maintaining geometric continuity.

    Each segment of the curve in Figure 16 is adjusted locally by controlling the values of α1 and α2. The legend of the graph shows the value of each α used to control a total of 29 curve segments locally. Additionally, a curve using only one global pair of shape parameter values is also plotted in the figure, as indicated by the black line. The curve generated with global values of β1=2 and β2=1, optimized by GCV, shows minimal differences in variation compared to the locally adjusted segments. This demonstrates the competency of the proposed method for data smoothing, highlighting its flexibility and efficiency to automatically smooth curves with large datasets by finding the global value of parameters. Moreover, it can be further enhanced by locally adjusting the value of shape parameters at particular segments.

    Even though knot insertion and deletion can be performed with standard cubic B-spline fitting to have a local control on the curve, this technique has several disadvantages. Knot insertion in B-spline fitting produces additional control points and necessitates the individual selection of knot values, making the process more tedious and time-consuming. Consequently, beta splines are more efficient in this context. Given that this study does not aim to model overfitted or underfitted curves, finding an optimal solution to yield smoother curves is crucial. A possible combination could be β1=2 and β2=1, corresponding to Figure 17, where a good midpoint seems to be reached. Looking at the combination on the grid of Figure 11, it can be seen that by increasing the value of β1, the GCV error decreases while it increases with the higher value of β2. The optimal value can be referred to as the light shade of pink and blue color grid. It shows that β1=2 with β2 from 0 to 4 has almost the same GCV error and curves. According to the minimum complexity principle, the solutions of β1=2 and β2=1 are selected and then applied to smooth all northern stations' temperature data.

    Figure 17.  Temperature curve with optimal GCV error.

    To evaluate the effectiveness and accuracy of the proposed method, a comparison was conducted with standard smoothing techniques such as moving average, simple exponential smoothing (SES), and double exponential smoothing (DES). In Figure 18, the results for SES with α=0.5793, DES with α=0.5573 and β=0.0001, and moving average with an order of 4 are presented. These parameter values were automatically estimated using nonlinear minimization of the observed data using the R built-in function, as it is more robust and objective than choosing parameters based on subjective experience. The figure clearly demonstrates that the proposed method closely follows the shape of the data's control polygon. In contrast, the moving average, SES, and DES methods exhibit some lag in capturing the data's pattern. This comparison highlights the superior performance of the proposed method in accurately reflecting the underlying trends and variations in the data over traditional smoothing techniques.

    Figure 18.  Comparison of the proposed and standard smoothing methods.

    Another comparison of smoothed curves using the proposed method, B-spline, and Fourier is presented in Figure 19. The smooth fit in the figure was achieved with 13 Fourier basis functions, a B-spline basis with repeated knots, and a beta spline basis with optimal parameters β1=2 and β2=1. For the Fourier series, the number of basis functions was determined by adding basis functions until the estimated variance ceased to decrease significantly, following the guidelines of [14]. Figure 20 shows how variance estimate decreases to a value by the time 13 Fourier basis functions were used for smoothing the temperature data. Although lower estimated variance values were found in some instances, they were not selected to avoid overfitting the data. The figure demonstrates that the proposed method produced a superior curve compared to other methods, effectively capturing the data pattern without overfitting unnecessary details. The beta spline's ability to adjust the curve by manipulating shape parameter values provides flexibility, surpassing other methods that require adjustments to the number of basis functions or knot values to achieve the desired curve.

    Figure 19.  Comparison of the proposed and existing basis functions.
    Figure 20.  The relation between the number of Fourier basis functions and the unbiased estimate of the residual variance fitting the temperature data.

    Despite periodicity being a typical characteristic of weather data, for which the Fourier basis is often chosen, the smoothed curves using B-spline and beta spline yielded better results than the Fourier basis, as demonstrated in Figure 19. This outcome aligns with the findings of [27], where Fourier basis functions were initially used for data smoothing due to the periodic nature of air temperature series. However, B-splines yielded slightly better forecast results because of their ability to balance model flexibility and overfitting, capturing complex patterns within the data to ensure accurate forecasts. Similar methods were observed in the studies by [25,40,41], which also employed B-splines for smoothing weather data instead of Fourier basis functions.

    Figure 21 shows temperature data from the main meteorological stations in northern Peninsular Malaysia. The curves were smoothed using a cubic beta spline with β1=2 and β2=1. The thick black line indicates the mean temperature of the northern region of Malaysia. The GCV method successfully determined the optimal values of the parameters, as the curves show a precisely fitted curve and offer the most optimal representation of the temperature data from each station. Identifying the optimal parameters is essential because they directly influence the accuracy of the curve when fitting the data, therefore improving data analysis such as forecasting. The proposed method can be extended to forecasting using various approaches, including functional regression models.

    Figure 21.  Temperature curves at the north region of Malaysia.

    Data smoothing is performed as a preliminary step to prepare the data for forecasting. For example, in the nonparametric FDA model, the continuous or functional term is considered a predictor variable. The smoothed observations generated through the smoothing process are used within the FDA framework for regression modeling to make advanced forecasts. Other than regression, multiple forecasting methods have been proposed within the FDA framework and can be considered for future research. By smoothing the data and constructing functions, the model can better capture underlying patterns and relationships, leading to more accurate forecasts. The problems of underfitted or overfitted curves and the undesirable result of failing to capture essential details of the data or including excessively unnecessary information can be avoided by implementing a flexible smoothing technique.

    This study presents a novel approach to data smoothing within the FDA framework to improve forecasting. Unlike commonly used basis such as Fourier or B-spline, beta spline, a spline with two shape parameters, was employed to transform discrete data into a functional form. This methodology, previously unexplored in the FDA framework and climate studies, offers several advantages in controlling curve shape by manipulating these parameters. Through experimentation, it was demonstrated that beta spline's flexibility is particularly effective in capturing intricate details in complex climate data, including extreme events. Additionally, an enhanced optimization methodology using generalized cross-validation was developed to determine the optimal combination of shape parameters for data representation. By manipulating these parameters, the proposed method offers a more flexible approach to data smoothing. The GCV color grid facilitates efficient identification of the best parameter combination, enabling the prevention of overfitting or underfitting by observing the error value color indicator. This technique is expected to perform effectively with various types of time series data.

    Wan Anis Farhah Wan Amir: Conceptualization, Formal Analysis, Investigation, Methodology, Software, Visualisation, Writing - original draft; Md Yushalify Misro: Data Curation, Funding acquisition, Project administration, Supervision, Writing - review & editing; Mohd Hafiz Mohd: Supervision, Writing - review & editing. All authors have read and approved the final version of the manuscript for publication.

    The authors declare that they have not used Artificial Intelligence tools in the creation of this article.

    This research was supported by the Ministry of Higher Education Malaysia through the Fundamental Grant Scheme (FRGS/1/2023/STG06/USM/03/4) and the School of Mathematical Sciences, Universiti Sains Malaysia. The authors express their sincere gratitude to the Malaysian Meteorological Department for providing the data used in this study.

    The authors declare that they have no conflicts of interest.



    [1] A. Montieri, D. Ciuonzo, G. Aceto, A. Pescapé, Anonymity services tor, i2p, jondonym: classifying in the dark (web), IEEE Trans. Dependable Secure Comput., 17 (2018), 662−675. https://doi.org/10.1109/TDSC.2018.2804394 doi: 10.1109/TDSC.2018.2804394
    [2] Y. Gao, J. Lin, J. Xie, Z. Ning, A real-time defect detection method for digital signal processing of industrial inspection applications, IEEE Trans. Ind. Inf., 17 (2021), 3450−3459. https://doi.org/10.1109/TII.2020.3013277 doi: 10.1109/TII.2020.3013277
    [3] W. Wang, N. Kumar, J. Chen, Z. Gong, X. Kong, W. Wei, et al., Realizing the potential of the internet of things for smart tourism with 5G and AI, IEEE Network, 34 (2020), 295−301. https://doi.org/10.1109/MNET.011.2000250 doi: 10.1109/MNET.011.2000250
    [4] R. Dingledine, N. Mathewson, P. Syverson, Tor: The second-generation onion router, in 13th USENIX Security Symposium, 2004 (2004), 303−320. https://doi.org/10.1016/0016-0032(45)90142-6
    [5] A. Cuzzocrea, F. Martinelli, F. Mercaldo, G. Vercelli, Tor traffic analysis and detection via machine learning techniques, in 2017 IEEE International Conference on Big Data, 2017 (2017), 4474−4480. https://doi.org/10.1109/BigData.2017.8258487
    [6] R. Jansen, M. Juarez, R. Galvez, T. Elahi, C. Diaz, Inside job: Applying traffic analysis to measure tor from within, Network Distributed Syst. Security, 2018 (2018). http://dx.doi.org/10.14722/ndss.2018.23261 doi: 10.14722/ndss.2018.23261
    [7] H. Yin, Y. He, I2P anonymous traffic detection and identification, in 2019 5th International Conference on Advanced Computing & Communication Systems, 2019 (2019), 157−162. https://doi.org/10.1109/ICACCS.2019.8728517
    [8] I. Clarke, O. Sandberg, B. Wiley, Freenet: A distributed anonymous information storage and retrieval system, Des. Privacy Enhancing Technol., 2001 (2001), 46−66. https://doi.org/10.1007/3-540-44702-4_4 doi: 10.1007/3-540-44702-4_4
    [9] S. Lee, S. H. Shin, B. H. Roh, Classification of freenet traffic flow based on machine learning, J. Commun., 13 (2018), 654−660. https://doi.org/10.12720/jcm.13.11.654-660 doi: 10.12720/jcm.13.11.654-660
    [10] S. Wang, Y. Gao, J. Shi, X. Wang, C. Zhao, Z. Yin, Look deep into the new deep network: A measurement study on the ZeroNet, in Computational Science-ICCS 2020, (2020), 595−608. https://doi.org/10.1007/978-3-030-50371-0_44
    [11] M. Wang, X. Wang, J. Shi, Q. Tan, Y. Gao, M. Chen, et al., Who are in the darknet measurement and analysis of darknet person attributes, in 2018 IEEE Third International Conference on Data Science in Cyberspace, 2018 (2018), 948−955. https://doi.org/10.1109/DSC.2018.00151
    [12] C. Fachkha, M. Debbabi, Darknet as a source of cyber intelligence: Survey, taxonomy, and characterization, IEEE Commun. Surv. Tutorials, 18 (2015), 1197−1227. https://doi.org/10.1109/COMST.2015.2497690 doi: 10.1109/COMST.2015.2497690
    [13] G. Draper-Gil, A. H. Lashkari, M. S. I. Mamun, A. A. Ghorbani, Characterization of encrypted and VPN traffic using time-related features, in Proceedings of the 2nd International Conference on Information Systems Security and Privacy, 1 (2016), 407−414. https://doi.org/10.5220/0005740704070414
    [14] Y. Hu, F. Zou, L. Li, P. Yi, Traffic classification of user behaviors in tor, i2p, zeronet, freenet, in 2020 IEEE 19th International Conference on Trust, Security and Privacy in Computing and Communications, (2020), 418–424. https://doi.org/10.1109/TrustCom50675.2020.00064
    [15] R. Rawat, V. Mahor, S. Chirgaiya, R. N. Shaw, A. Ghosh, Analysis of darknet traffic for criminal activities detection using TF-IDF and light gradient boosted machine learning algorithm, Innovations Electr. Electron. Eng., 2021 (2021), 671−681. https://doi.org/10.1007/978-981-16-0749-3_53 doi: 10.1007/978-981-16-0749-3_53
    [16] Q. A. Al-Haija, M. Krichen, W. A. Elhaija, Machine-learning-based darknet traffic detection system for IoT applications, Electronics, 11 (2022), 556. https://doi.org/10.3390/electronics11040556 doi: 10.3390/electronics11040556
    [17] A. H. Lashkari, G. Kaur, A. Rahali, DIDarknet: A contemporary approach to detect and characterize the darknet traffic using deep image learning, in 2020 the 10th International Conference on Communication and Network Security, (2020), 1−13. https://doi.org/10.1145/3442520.3442521
    [18] C. Liu, L. He, G. Xiong, Z. Cao, Z. Li, FS-Net: A flow sequence network for encrypted traffic classification, in IEEE INFOCOM 2019-IEEE Conference On Computer Communications, (2019), 1171−1179. https://doi.org/10.1109/INFOCOM.2019.8737507
    [19] M. Lotfollahi, M. J. Siavoshani, R. S. H. Zade, M. Saberian, Deep packet: A novel approach for encrypted traffic classification using deep learning, Soft Comput., 24 (2020), 1999−2012. https://doi.org/10.1007/s00500-019-04030-2 doi: 10.1007/s00500-019-04030-2
    [20] X. Wang, S. Chen, J. Su, App-Net: A hybrid neural network for encrypted mobile traffic classification, in IEEE INFOCOM 2020-IEEE Conference on Computer Communications Workshops, (2020), 424−429. https://doi.org/10.1109/INFOCOMWKSHPS50562.2020.9162891
    [21] M. B. Sarwar, M. K. Hanif, R. Talib, M. Younas, M. U. Sarwar, DarkDetect: Darknet traffic detection and categorization using modified convolution-long short-term memory, IEEE Access, 9 (2021), 113705−113713. https://doi.org/10.1109/ACCESS.2021.3105000 doi: 10.1109/ACCESS.2021.3105000
    [22] W. Cai, L. Xie, W. Yang, Y. Li, Y. Gao, T. Wang, DFTNet: Dual-path feature transfer network for weakly supervised medical image segmentation, IEEE/ACM Trans. Comput. Biol. Bioinf., 2022 (2022), 1−12. https://doi.org/10.1109/TCBB.2022.3198284 doi: 10.1109/TCBB.2022.3198284
    [23] X. Xie, Y. Li, Y. Gao, C. Wu, P. Gao, B. Song, et al., Weakly supervised object localization with soft guidance and channel erasing for auto labelling in autonomous driving systems, ISA Trans., 132 (2023), 39−51. https://doi.org/10.1016/j.isatra.2022.08.003 doi: 10.1016/j.isatra.2022.08.003
    [24] W. Wang, J. Chen, J. Wang, J. Chen, J. Liu, Z. Gong, Trust-enhanced collaborative filtering for personalized point of interests recommendation, IEEE Trans. Industrial Inf., 16 (2020), 6124−6132. https://doi.org/10.1109/TII.2019.2958696 doi: 10.1109/TII.2019.2958696
    [25] Y. Tokozume, Y. Ushiku, T. Harada, Between-class learning for image classification, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018 (2018), 5486−5494, arXiv.1711.10284
    [26] Y. Gao, J. Chen, H. Miao, B. Song, Y. Lu, W. Pan, Self-learning spatial distribution-based intrusion detection for industrial cyber-physical systems, IEEE Trans. Comput. Social Syst., 9 (2022), 1693−1702. https://doi.org/10.1109/TCSS.2021.3135586 doi: 10.1109/TCSS.2021.3135586
    [27] A. H. Lashkari, G. Draper-Gil, M. S. I. Mamun, A. A. Ghorbani, Characterization of tor traffic using time based features, in Proceedings of the 3rd International Conference on Information Systems Security and Privacy, 2017 (2017), 253−262. https://doi.org/10.5220/0006105602530262
    [28] F. R. Torres, J. A. Carrasco-Ochoa, J. F. Martínez-Trinidad, SMOTE-D a deterministic version of SMOTE, in Mexican Conference on Pattern Recognition, 9703 (2016), 177−188. https://doi.org/10.1007/978-3-319-39393-3_18
    [29] H. Lee, J. Kim, S. Kim, Gaussian-based SMOTE algorithm for solving skewed class distributions, Int. J. Fuzzy Logic Intell. Syst., 17 (2017), 229−234. https://doi.org/10.5391/IJFIS.2017.17.4.229 doi: 10.5391/IJFIS.2017.17.4.229
  • Reader Comments
  • © 2023 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(1681) PDF downloads(45) Cited by(1)

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog