Malware detection based on semi-supervised learning with malware visualization

Tan Gao; Lan Zhao; Xudong Li; Wen Chen; Tan Gao; Lan Zhao; Xudong Li; Wen Chen

doi:10.3934/mbe.2021300

Mathematical Biosciences and Engineering

2021, Volume 18, Issue 5: 5995-6011. doi: 10.3934/mbe.2021300

Previous Article Next Article

Research article Special Issues

Malware detection based on semi-supervised learning with malware visualization

1.
School of Cyber Science and Engineering, Sichuan University, China
2.
Science and Technology on Electronic Information Control Laboratory, China

Received: 06 April 2021 Accepted: 24 May 2021 Published: 02 July 2021

The traditional signature-based detection method requires detailed manual analysis to extract the signatures of malicious samples, and requires a large number of manual markers to maintain the signature library, which brings a great time and resource costs, and makes it difficult to adapt to the rapid generation and mutation of malware. Methods based on traditional machine learning often require a lot of time and resources in sample labeling, which results in a sufficient inventory of unlabeled samples but not directly usable. In view of these issues, this paper proposes an effective malware classification framework based on malware visualization and semi-supervised learning. This framework includes mainly three parts: malware visualization, feature extraction, and classification algorithm. Firstly, binary files are processed directly through visual methods, without assembly, decompression, and decryption; Then the global and local features of the gray image are extracted, and the visual image features extracted are fused on the whole by a special feature fusion method to eliminate the exclusion between different feature variables. Finally, an improved collaborative learning algorithm is proposed to continuously train and optimize the classifier by introducing features of inexpensive unlabeled samples. The proposed framework was evaluated over two extensively researched benchmark datasets, i.e., Malimg and Microsoft. The results show that compared with traditional machine learning algorithms, the improved collaborative learning algorithm can not only reduce the cost of sample labeling but also can continuously improve the model performance through the input of unlabeled samples, thereby achieving higher classification accuracy.

Keywords:

Citation: Tan Gao, Lan Zhao, Xudong Li, Wen Chen. Malware detection based on semi-supervised learning with malware visualization[J]. Mathematical Biosciences and Engineering, 2021, 18(5): 5995-6011. doi: 10.3934/mbe.2021300

Related Papers:

[1]	Xiao Wang, Jianbiao Zhang, Ai Zhang, Jinchang Ren . TKRD: Trusted kernel rootkit detection for cybersecurity of VMs based on machine learning and memory forensic analysis. Mathematical Biosciences and Engineering, 2019, 16(4): 2650-2667. doi: 10.3934/mbe.2019132
[2]	Chen Shao, Yue zhong yi Sun . Shilling attack detection for collaborative recommender systems: a gradient boosting method. Mathematical Biosciences and Engineering, 2022, 19(7): 7248-7271. doi: 10.3934/mbe.2022342
[3]	Chunkai Zhang, Yingyang Chen, Ao Yin, Xuan Wang . Anomaly detection in ECG based on trend symbolic aggregate approximation. Mathematical Biosciences and Engineering, 2019, 16(4): 2154-2167. doi: 10.3934/mbe.2019105
[4]	Yangjie Sun, Xiaoxi Che, Nan Zhang . 3D human pose detection using nano sensor and multi-agent deep reinforcement learning. Mathematical Biosciences and Engineering, 2023, 20(3): 4970-4987. doi: 10.3934/mbe.2023230
[5]	Xiaowen Jia, Jingxia Chen, Kexin Liu, Qian Wang, Jialing He . Multimodal depression detection based on an attention graph convolution and transformer. Mathematical Biosciences and Engineering, 2025, 22(3): 652-676. doi: 10.3934/mbe.2025024
[6]	Ansheng Ye, Xiangbing Zhou, Kai Weng, Yu Gong, Fang Miao, Huimin Zhao . Image classification of hyperspectral remote sensing using semi-supervised learning algorithm. Mathematical Biosciences and Engineering, 2023, 20(6): 11502-11527. doi: 10.3934/mbe.2023510
[7]	Zhanhong Qiu, Weiyan Gan, Zhi Yang, Ran Zhou, Haitao Gan . Dual uncertainty-guided multi-model pseudo-label learning for semi-supervised medical image segmentation. Mathematical Biosciences and Engineering, 2024, 21(2): 2212-2232. doi: 10.3934/mbe.2024097
[8]	H. Y. Swathi, G. Shivakumar . Audio-visual multi-modality driven hybrid feature learning model for crowd analysis and classification. Mathematical Biosciences and Engineering, 2023, 20(7): 12529-12561. doi: 10.3934/mbe.2023558
[9]	Keying Du, Liuyang Fang, Jie Chen, Dongdong Chen, Hua Lai . CTFusion: CNN-transformer-based self-supervised learning for infrared and visible image fusion. Mathematical Biosciences and Engineering, 2024, 21(7): 6710-6730. doi: 10.3934/mbe.2024294
[10]	Naigong Yu, Hongzheng Li, Qiao Xu . A full-flow inspection method based on machine vision to detect wafer surface defects. Mathematical Biosciences and Engineering, 2023, 20(7): 11821-11846. doi: 10.3934/mbe.2023526

Abstract

1. Introduction

Malware has become a major threat to network security ^[1]. Traditional signature-based methods extract binary signatures from malware to build a huge feature library, which provides comprehensive information of malicious samples but requires much time and effort ^[2]. Meanwhile, the enormous malware variations also brought great challenges to signature-based detection methods.

In recent years, many new malware detection methods have been proposed, ranging from multi-signature methods to static analysis, dynamic detection and heuristic detection. However, the anti-detection technology also constantly improved. Malware has learned to change their feature through object-code obfuscation, code refactoring and etc. According to the report of Symantec and McAfee, approximately 69 new instances of malware are generated per minute, and more than 50% of them are variants of existing malware ^[3]. The traditional feature extraction method cannot afford the huge cost brought by manually marking new samples. These new variants usually have the same malicious intentions and characteristics as the original malware ^[4,5]. Such a group of malware samples with similar attacking patterns is called a malware family. Recognition of malware families relies on quickly analyzing the behaviors and functions of malware.

In face of the new challenges, some researchers try to explore malware features using machine learning technologies ^[4,5,6]. Nataraj proposed to categorize the malware family by visualizing malware ^[7]. The method not only shows the visual similarity between different samples in the same family but also adapts to the common fuzzy coding techniques. Accordingly, neural networks are also utilized to analyze visual malware and achieved promising results ^[8,9]. However, due to the complexity of the neural network, the huge cost of the training process makes it hard to catch up with the rapid growth of malware variants ^[10,11,12]. Besides, most of the active neural networks need supervised learning, requiring human experts and special tools to abstract the malware features and labels of new samples, which is an extremely expensive and inefficient process ^[13].

Regarding these problems, we proposed a malware detection model based on malware visualization and collaborative learning. In the model, firstly the malware binaries are visualized to gray graphs, and then the malware family features are automatically extracted from the graphs using image feature extractors, such as LBP extractor. Finally, the features are sent to the cooperative learning models with multiple classifiers to recognize malware. Furthermore, noise learning theory is incorporated into the training process to exclude noise in the unlabeled samples to ensure that the model's error classification rate could be continuously reduced.

The contributions in this paper are summarized as follows:

● We proposed a malware classification framework, which integrated malware visualization, automatic feature extraction, and collaborative learning. The framework directly processes malware binaries without disassembly, decompression, and decryption.

● This framework continuously improved the classification ability of the model through the introduction of unlabeled samples, which solved the issue of lack of labeled malicious samples in actual scenes.

● Comparative experiments were conducted on two widely studied imbalanced benchmark datasets, Malimg and Microsoft. Experimental results show that the proposed framework can achieve excellent classification performance, with the accuracy of 0.98 and 0.94, respectively. Compared with the state-of-the-art methods, our method is more resistant to the effects of data imbalance.

2. Related work

2.1. Development of malware detection

With the development of machine learning and visualization technology, researchers have begun to draw on the visualization ideas in the field of computer forensics and network monitoring to visualize and classify malware ^[7]. First, the raw binary data in the malware is converted to grayscale images, in which each byte is represented as a grayscale pixel of the image. These byte sequences are then combined into a 2D array, and feature extraction is performed through texture analysis to convert the malware detection into an image classification task. In such a way, not only the characteristic information of software can be visualized, but also the detection efficiency of malware is improved compared with the traditional methods ^[14,15,16]. Furthermore, in contrast with traditional static analysis methods, malware visualization is more suitable for malicious samples adapted by obfuscation technology ^[17].

Recently, quantities of visual detection methods based on malware have been proposed ^[18,19]. However, these methods still have some shortages, including the lack of available labeled samples in real applications, the great gap between feature extraction algorithms and visual feedback, and etc. ^[20] Therefore, we apply a semi-supervised learning algorithm to alleviate the issue of insufficient samples and continuously improves the classification performance through the utilization of unlabeled samples along with noise learning theory.

2.2. Semi-supervised learning algorithm

Supervised algorithms have achieved promising performance of malware detection, however, they rely on plenty of labeled samples for training, which is difficult to be satisfied in real applications ^[20,21]. On the other hand, unsupervised learning can employ unlabeled samples for training but often gets lower accuracy ^[22]. Compared with these two types of classification algorithms, the semi-supervised learning algorithm only needs a small number of labeled samples in the training stage and can continuously enhance detection performance through the use of a large number of unlabeled samples ^[23,24]. Consequently, semi-supervised learning is more suitable for the applications of malware detection.

The semi-supervised learning algorithm co-training ^[25] assumes that we have two redundant and independent feature views to deal with data. And then, at the initial training stage, some labeled samples are summited to two basic classifiers in different feature views. After initial training, unlabeled samples with high label confidence are selected and these "pseudo-label" samples are put into the updating set for further training. Through the process of "learning from each other and making progress together", the classifiers iteratively updating in each training round until their performances are stable. However, the conditional independence of the two feature views is difficult to satisfy. S. Goldman and Y. Zhou proposed to improve the classifiers by collaborative learning ^[26]. Although this method has removed the requirement of redundant feature views, it still restricts the types of base classifiers, and the repeated ten-fold cross-validation in the updating process results in an overwhelming cost. In response to this problem, Zhi-Hua Zhou et al. proposed a Tri-training algorithm that neither requires sufficient redundant views nor restricts the type of classifiers ^[27]. The algorithm easily handles the problem of labeling confidence estimation and predictive classification of unknown samples by using three collaborative classifiers.

In this paper, the idea of noise learning theory ^[28] is utilized to improve the collaborative learning based on the original Tri-training algorithm. In each iteration stage, parts of "pseudo-label" samples are extracted for error rate calculations and threshold evaluation to reduce the tag error rate of "pseudo-label". For a detailed description of the model, see Section 3.

3. Methodology

The work of this paper focuses on the detection of malware. Figure 1 shows the architecture of our malware classification model, which mainly consists of three major components: malware visualization, feature extraction, and the tri-training classification algorithm.

Figure 1. The overall process of malware detection for collaborative learning.

DownLoad: Full-Size Img PowerPoint

3.1. Malware visualization

As shown in the first stage in Figure 1, The malware visualization transforms the binary codes into images with certain characteristic information ^[7,14]. For a given binary file, each 8-bit is transformed into one unsigned integer, and the result of these variables is reorganized into a two-dimensional matrix. The corresponding value in the matrix can be expressed as the gray value of the generated image in the range of [0,255], where 0 and 255 represent black and white respectively. Figure 2 shows the visualization process of a binary malware file into a grayscale image.

Figure 2. Process of malware visualization.

DownLoad: Full-Size Img PowerPoint

The images converted from malware have the same width and different heights. The widths of images of different sizes are required to be pondered carefully, in case the generated images are extremely high or too wide, which degrades the performance of feature extraction ^[7]. Table 1 gives some recommended image widths for the malicious samples.

Table 1. Image width for various sizes.

File size	Optional width
< 10KB	32
10KB–30KB	64
…	…
1000KB–2000KB	1024
2000KB–4000KB	1280
4000KB–8000KB	1536
8000KB–10MB	1792
10MB–15MB	2048
15MB–20MB	2560
20MB–25MB	3072
25MB–30MB	4096
> 30MB	5120

| Show Table

DownLoad: CSV

3.2. Feature extraction

Take the obfuscated code technologies into consideration, such as fragment encryption and instruction substitution in consideration, the texture features of the generated gray-scale image may be disturbed and deformed. Hence, to ensure that the image features utilized in the training are stable and robust, both the local texture features and global features of the images are extracted and fused through the canonical correlation analysis (CCA) method ^[29].

3.2.1. Feature selection

Before feature extraction, the SIFT feature ^[30], HOG feature ^[31], and LBP feature ^[32] are analyzed and compared for the local feature extraction of malicious samples' gray images ^[33]. SIFT features can keep the rotation, brightness, and scale of the image unchanged, but demand enormous computation. Since the characteristic gamma correction is performed at the end of the image grayscale calculation, the feature extracted by HOG can reduce the negative impact of local shadow and light changes. Nevertheless, the HOG extraction has some defects, such as the prolix generation process and sensitivity to noise points, which leads to high-risk costs. Thus, a more steady and efficient feature extraction method is required.

Many malicious samples are derived from some classic malware. Source from the structure similarity of the malware family, the pixels of the converted gray images have some equal proportions on the whole or in some local areas ^[7,20]. Similarly, for the LBP feature extraction, the relative size of the central pixel and the overall gray level of the neighborhood remain unchanged even if they change simultaneously ^[34,35]. Furthermore, The LBP descriptor has benign adaptability to the effect of image rotation. Depending on the nature of this event, the LBP method has high adaptability and robustness for the detection of different groups of malicious samples.

Therefore, LBP is utilized to extract the local features. It employs a descriptor window as a circular area, for a neighborhood with $p$ pixels containing a one-pixel set $\{{g}_{0}, {g}_{1}, ..., {g}_{p-1}\}$ , and the encoding method of LBP is as:

$LB{P}_{P, R} = {\sum }_{p = 0}^{P-1}s({g}_{p}-{g}_{c}){2}^{p}, s\left(x\right) = \left\{\begin{array}{c}1, \\ 0, \end{array}\right.\text{}\begin{array}{c}x\ge 0\\ x < 0\end{array}$

(1)

where ${g}_{c}$ and ${g}_{p}$ represent the gray value of the central pixel and circular neighborhood pixel respectively, and $R$ is the radius of the neighborhood. For such a circular LBP operator, the relative position of the center pixel ${g}_{c}$ and ${g}_{p}$ changes with the image rotation, resulting in various LBP values. Consequently, we adopt the uniform rotation invariant LBP operator, which can adapt to image rotation and has anti-noise property for a large number of modes generated in the circular neighborhood of different sizes, defined by:

$LB{P}_{P, R}^{ri{u}^{2}} = \left\{\begin{array}{c}{\sum }_{p = 0}^{p-1}s({g}_{p}-{g}_{c})\\ P+1\end{array}\right.\begin{array}{c}if\; U\left(LB{P}_{P, R}\right)\le 2\\ otherwise\end{array}$

(2)

where $s\left(x\right) = \left\{\begin{array}{c}1\\ 0\end{array}\right.\begin{array}{c}x\ge 0\\ x < 0\end{array}$ and $U$ represents the LBP uniformity space conversion times (the transition between binary bits 0/1), expressed as:

$U\left(LB{P}_{P, R}\right) = {\sum }_{p = 1}^{P}\left|s\right({g}_{p}-{g}_{c})-s({g}_{p-1}-{g}_{c}\left)\right|$

(3)

The LBP feature has the advantage of stability and noise anti-interference for the regional feature description, which mainly focuses on the local feature while lacking the global feature description of the images. Therefore, after extracting the LBP features of the malicious sample image, the global feature of the image was also extracted, this reduced the noise through Gaussian fuzzy processing. The image is divided into grids by the average size of 16 × 16. And the global image feature computes the mean value and average variance of pixel intensity in each grid of image where the two-dimensional Gaussian function is employed to calculate the weight of each pixel shown in (4).

$G(x, y) = \frac{1}{2\pi {\sigma }^{2}}{e}^{-({x}^{2}+{y}^{2})/2{\sigma }^{2}}$

(4)

where $\sigma$ is the variance of $x$ .

3.2.2. Feature fusion

To enhance the correlation between features and images, canonical correlation analysis (CCA) is applied to the fusion of image features after extracting the global and local features. CCA is an effective multi-data processing method ^[36], which can mine the potential association relationship between two sets of variables to obtain more representative data. The general idea of CCA is to obtain the maximum correlation coefficient between the linear combination of two groups of variables through a large number of matrix calculations, to establish the relationship between the two groups of variables ^[37]. The CCA feature fusion produces a structure containing $D$ null hypothesis (H0) and related information after the fusion of the two features X(local) and Y(global), which $D$ takes the minimum values of the ranks of the two feature matrices X and Y: $D = min\left(rank\right(X), rank(Y\left)\right)$ . Relevant information statistics of fusion are shown in Table 2.

Table 2. Statistics on CCA fusion methods.

Wilks	chisq	pChisq	F	pF	df1	df2
Wilks Lambda	Approximate chi-squared statistic for H0	The right-tail significance level for chisq	Approximate chi-squared statistic for H0_K	The right-tail significance level for F	The numerator degrees of freedom for F	The denominator degrees of freedom for F

| Show Table

DownLoad: CSV

3.3. Tri-Training algorithm based on Noise learning judgment

The Tri-training algorithm contains three basic classifiers: ${C}_{1}, {C}_{2}, {C}_{3}$ . ${C}_{i}(i\in \{\mathrm{1, 2}, 3\left\}\right)$ . A small number of labeled samples L along with a large number of unlabeled samples U are applied to the training process to carry out co-training ^[25] of the basic classifiers.

For the three classifiers ${C}_{i}, {C}_{j}, {C}_{k}$ (where $i, j, k\in \{1, 2, 3\}$ and $j, k\ne i$ ) in each round of co-training, ${C}_{j}$ and ${C}_{k}$ are used to randomly select samples from the unlabeled sample set $U$ to make predictions, where the samples $\left\{x\right|predic{t}_{k}\left(x\right) = predic{t}_{j}\left(x\right)\}$ labeled by ${C}_{j}$ and ${C}_{k}$ are considered to have higher labeling confidence, and then they are taken as new labeled data and put into the updating set ${L}_{i}$ of the classifier ${C}_{i}$ . ${L}_{i}$ along with the known labeled sample set $L$ to update the classifier ${C}_{i}$ . After each round of updating, the newly labeled samples are put back in $U$ to start a new round of iteratively training until all the classifiers do not change. As shown in formula (5), after the iterative training, the class of the sample is predicted through a voting mechanism.

$h\left(x\right) = \underset{y\in label}{argmax}\underset{i, {h}_{i}\left(x\right) = y}{\sum 1}$

(5)

If the prediction of ${C}_{j}$ and ${C}_{k}$ on $x$ is correct, ${C}_{i}$ will get a valid new example for further training; otherwise, ${C}_{i}$ will get a noise example with an incorrect label.According to the noise learning ^[28] theory of Angluin and Laird, for a training sequence ${\gamma }_{i}$ containing m samples, if the sample size $m$ is:

$m\ge \frac{2}{{\epsilon }^{2}(1-2\eta {)}^{2}}ln\left(\frac{2N}{\delta }\right)$

(6)

where $\epsilon$ is the upper limit of the classifier error and $\eta$ is the classification error of the training set rate, $N$ is the number of categories, $\delta$ is the confidence parameter, then the PAC (probably approximately correct identification) judgment for the real classification h* is:

${P}_{r}\left[d\right({h}_{j}, {h}^{*})\ge \epsilon ]\le \delta$

(7)

where $d({h}_{j}, {h}^{*})$ represents the sum of the probabilities of the elements of the symmetric difference between the hypothesis (classifier) ${h}_{j}$ and the real situation ${h}^{*}$ . Given the confidence parameter $\delta$ and the upper limit of classification error $\epsilon$ , formula (8) can be transformed into formula (9):

$(1-2\eta {)}^{2} > \frac{2}{{\epsilon }^{2}m}ln(\frac{2N}{\delta })$

(8)

Expand the left side of formula (8), we can get:

$1-\frac{2}{{\epsilon }^{2}m}ln\left(\frac{2N}{\delta }\right)\ge 4(\eta -{\eta }^{2})\ge 4\eta$

(9)

Thus

$1-\frac{2}{{\epsilon }^{2}m}ln\left(\frac{2N}{\delta }\right)\ge 4\eta$

(10)

That is, for the given confidence parameter $\delta$ and the upper limit of classification error $\epsilon$ , We need to guarantee that:

$\eta \le \frac{1-\frac{2}{{\epsilon }^{2}m}ln\left(\frac{2N}{\delta }\right)}{4}$

(11)

To simplify the calculation of equation (10), let $c = 2\mu ln\left(\frac{2N}{\delta }\right)$ , where $\mu$ the coefficient that makes equation (11) is equal, and then we can get:

$\frac{c}{{\epsilon }^{2}} = m(1-4\eta )$

(12)

It can be seen from formula (12) that the square term of the upper limit of error $\epsilon$ is inversely proportional to $m(1-4\eta)$ . Samples that are temporarily marked by two of the classifiers in each round can be called pseudo-marked samples. Since the number of unlabeled examples selected for each round of tri-training is not fixed, let ${L}^{t}$ be the pseudo-sample set labeled ${C}_{1}$ for the ${t}^{th}$ round, and the error rate of sample detection is ${\eta }_{t}$ , then for the sample size of this round ${M}_{t} = \left|L\right|+\left|{L}^{t}\right|$ . Compared with the previous round, if the training result of this round is improved, it is necessary to ensure that the (13) must be satisfied:

$\left(\right|L|+|{L}_{u}^{t}\left|\right)(1-4{\eta }^{t}) > \left(\right|L|+|{L}_{u}^{t-1}\left|\right)(1-4{\eta }^{t-1})$

(13)

That is, updating the classifier by the $\left|{L}^{t}\right|$ unknown samples introduced in this round can continue to improve the classification performance, because the upper limit error $\epsilon$ of the training result of the ${t}^{th}$ round is lower than that of the $t-{1}^{th}$ round; otherwise, the newly-labeled samples of this round are abandoned and start re-sample training in $U$ . For equation (13), the sample detection error rate is ${\eta }^{t} = \frac{{e}_{L}^{t}\left|L\right|+{e}_{u}^{t}\left|{L}_{u}^{t}\right|}{|L+{L}_{u}^{t}|}$ , where ${e}_{L}^{t}$ and ${e}_{u}^{t}$ represent the error rate of the classifier on the labeled sample set $L$ and the unlabeled sample set ${L}_{u}$ during the ${t}^{th}$ round of training respectively. To ensure that the training process can continuously reduce the upper limit of classification error $\epsilon$ , substituting the ${\eta }_{t}$ expression into equation (13), we can get:

$\left|{L}_{u}^{t}\right|-4\left({e}_{L}^{t}\right|L|+{e}_{u}^{t}|{L}_{u}^{t}\left|\right) > \left|{L}_{u}^{t-1}\right|-4\left({e}_{L}^{t-1}\right|L|+{e}_{u}^{t-1}|{L}_{u}^{t-1}\left|\right)$

(14)

In most cases, for the classification error rate of the model on the labeled samples, ${e}_{L}^{t} < < {e}_{u}^{t}$ , thus ${e}_{L}^{t}$ can be ignored. So equation (14) can be simplified as (15).

$\left| {L_u^t} \right| - (1 - 4e_u^t) > \left| {L_u^{t - 1}} \right|(1 - 4e_u^{t - 1}), {\rm{that}}\;{\rm{is}}\\ \frac{\left|{L}_{u}^{t}\right|}{\left|{L}_{u}^{t-1}\right|} > \frac{1-4{e}_{u}^{t-1}}{1-4{e}_{u}^{t}}$

(15)

Therefore, in the tri-training process of the classifiers ${C}_{i}, {C}_{j}, {C}_{k}(i, j, k\in \{\mathrm{1, 2}, 3\}, and\; j, k\ne i)$ , for the two adjacent rounds of training, when the size and error rate of the new labeled samples meet equation (15), the new labeled samples with the same labels from ${C}_{j}$ and ${C}_{k}$ are submitted to ${C}_{i}$ for updating; otherwise, the newly labeled samples in this round are discarded, and unlabeled samples are re-selected from $U$ for training. For the classification error rate ${e}_{u}^{t}$ of the "pseudo-labeled" samples that cannot be directly calculated, this paper adopts the idea of ten-fold cross-validation. In each round of the iterative training, $\frac{1}{10}$ of the labeled samples are randomly selected from $L$ as the test set to estimate ${e}_{u}^{t}$ , the remaining samples in $L$ are combined with U for trained together. The Tri-training training with noise judgment is listed in Table 3.

Table 3. Improved Tri-training algorithm based on noise learning theory.

| Show Table

DownLoad: CSV

4. Experiment

4.1. Dataset

To evaluate the performance of the proposed method, we carried a group of comparisons based on the Malimg data set ^[7] and the Microsoft Malware Classification Challenge data set ^[38] (referred to as the Microsoft data set). The description of The Microsoft and Malimg datasets are shown in Table 4 and Table 5, respectively.

Table 4. Sample distribution of malware for the Malimg dataset.

Class ID	Class	Family	Sample size	Population
1	Adialer.C	Dialer	122	1.3%
2	Agent.FYI	Backdoor	116	1.2%
3	Allaple.A	Worm	2949	31.6%
4	Allaple.L	Worm	1591	17.0%
5	Alueron.gen!J	Trojan	198	2.1%
6	Autorun.K	Worm: AutoIT	106	1.1%
7	C2Lop.gen!G	Trojan	200	2.1%
8	C2Lop.P	Trojan	146	1.6%
9	Dialplatform.B	Dialer	177	1.9%
10	Dontovo.A	Trojan downloader	162	1.7%
11	Fakerean	Rogue	381	4.1%
12	Instantaccess	Dialer	431	4.6%
13	Lolyda.AA 1	PWS	213	2.3%
14	Lolyda.AA 2	PWS	184	2.0%
15	Lolyda.AA 3	PWS	123	1.3%
16	Lolyda.AT	PWS	159	1.7%
17	Malex.gen!J	Trojan	136	1.5%
18	Obfuscator.AD	Trojan downloader	142	1.5%
19	Rbot!gen	Backdoor	159	1.7%
20	Skintrim.N	Trojan	80	0.9%
21	Swizzor.gen!E	Trojan downloader	128	1.4%
22	Swizzor.gen!I	Trojan downloader	132	1.4%
23	VB.AT	Worm	408	4.4%
24	Wintrim.BX	Trojan downloader	97	1.0%
25	Yuner.A	Worm	800	0.086

| Show Table

DownLoad: CSV

Table 5. Sample distribution of malware for the Microsoft dataset.

Class ID	Class	Sample size	Population
1	Ramnit	1541	14.2%
2	Lollipop	2478	22.8%
3	Kelihos ver3	2942	27.1%
4	Vundo	475	4.4%
5	Simda	42	0.4%
6	Tracur	751	6.9%
7	Kelihos ver1	398	3.7%
8	Obfuscator.ACY	1228	11.3%
9	Gatak	1013	9.3%

| Show Table

DownLoad: CSV

Tables 4 and 5 show that the distribution proportions of malicious samples in the Malimg dataset and the Microsoft dataset are imbalanced, and the proportions of different sample groups are different. For example, in the Malimg data set, these two sample families of ALLAPLE in the 25 sample families accounted for 48.6% of the total, while the remaining 23 samples only accounted for 51.4%; in the Microsoft data set, the Simda samples in the 9 sample families only accounted for 0.4% of the total. For traditional supervised learning algorithms, the imbalanced distribution of malicious samples often leads to overfitting and poor classification performance.

4.2. Feature fusion methods analysis

Firstly, the LBP feature and the average grid gray intensity of the malicious sample images in the Malimg data set are extracted. And then CCA fusion is carried out for these two features. Finally, the fusion results are submitted to a co-learning model for the training of classifiers. The results of the classifications are shown in Figure 3. It can be seen that as the proportion of training data increases, the classification results on CCA are better, which is attributed to the suppression of the repellency between the two variables by the CCA method. By calculating the correlation coefficient of the two one-dimensional data obtained by linear transformation projection, the CCA method maximizes the correlation between the two dimensions, thus obtaining more discriminative data characteristics. Moreover, after CCA fusion the dimensionality of the data is reduced, which greatly saved the training cost of co-learning. The time comparison between traditional serial fusion and CCA fusion in different sample sizes (in seconds) is shown in Table 6. And from the table, we can see that, after CCA fusion, the time cost is dramatically reduced.

Figure 3. Accuracy for fusion methods.

DownLoad: Full-Size Img PowerPoint

Table 6. Time cost for fusion methods.

	Serial Fusion	CCA Fusion
4000	378.15s	54.04s
5000	589.81s	106.16s
6000	846.73s	319.62s

| Show Table

DownLoad: CSV

4.3. Algorithm performance analysis

5 alternative linear classifiers are employed for collaborative learning. To ensure the degree of divergence among the cooperative learning classifiers, we eventually selected random forest, KNN, and LR as the three candidate classifiers ${C}_{1}$ , ${C}_{2}$ and ${C}_{3}$ for tri-training. In the experiment, the three classifiers are initialized with differentiated sample characteristics at first. Then the three classifiers are trained by a cooperative learning algorithm periodically. After the training, different types of malicious samples are predicted on the test set and compared with traditional machine learning classifiers.

We have conducted an experimental effect analysis of the collaborative learning algorithm on the Malimg data set, as shown in . In the experiment, ${C}_{1}$ , ${C}_{2}$ and ${C}_{3}$ respectively represent the selected single classical classifier, and T represents the cooperative learning algorithm satisfying the noise learning theory. As shown in Figure 4, due to the low sample size at the beginning, the classification performance of all training algorithms is poor. With the continuous learning from unlabeled samples, the classifier is continuously improved, while the classification performance of collaborative learning has raised more obviously; when the number of unlabeled samples reaches 3000, the accuracy of fusion classification can exceed 95%, and the accuracy is raised until the number reaches 3400. The results demonstrated that the collaborative learning algorithm can continuously improve classification accuracy through a continuous utilization of unlabeled samples.

Figure 4. Accuracy of each algorithm.

DownLoad: Full-Size Img PowerPoint

However, for such a data set with imbalanced samples, simply calculating the correct rate of the model cannot comprehensively reflect the advantages of the proposed model. Consequently, in addition to calculating the accuracy of the overall model, the Precision, Recall, and F1- score of the model are also calculated to further evaluate our model. The $P$ (precision), $R$ (recall) and $F$ (F1-score) of each given class are calculated first, and then average the F1 scores of all classes to calculate the weighted-average F1 score. The calculation formulas of $P$ , $R$ and $F$ are shown in Equations (16)–(19).

${P}_{i} = \frac{T{P}_{i}}{F{P}_{i}}$

(16)

${R}_{i} = \frac{T{P}_{i}}{F{N}_{i}}$

(17)

${F}_{i} = \frac{2\times {P}_{i}\times {R}_{i}}{{P}_{i}+{R}_{i}}$

(18)

and

$weighted-averaged F1 = \frac{1}{n}{\sum }_{i = 1}^{n}{F}_{i}$

(19)

where $T{P}_{i}$ is the number of samples that are correctly classified in the ${i}^{th}$ category, $F{P}_{i}$ is the number of samples that are misclassified into the ${i}^{th}$ category, and $F{N}_{i}$ indicates how many samples belonging to the ${i}^{th}$ class are misclassified to other classes.

The evaluation results based on Malimg and Microsoft are shown in Table 7 and Table 8. The Malimg dataset has an F1-score of up to 97.23%, while the accuracy rate of the Microsoft dataset can reach 94.09%. Therefore, compared with other classic classification algorithms, our method can not only reduce the input cost of labeled samples but also has better detection accuracy.

Table 7. Classification representation of different approaches on the Malimg dataset (%).

Approach	Accuracy	Precision	Recall	F1-Score
RandomForest SVM KNN LogisticRegression GBDT Our method	95.06 93.38 93.31 92.69 93.37 97.95	99.48 1 99.87 98.39 1 99.95	93.00 95.37 93.31 92.68 93.57 95.06	96.16 96.57 96.48 95.43 96.57 97.23

| Show Table

DownLoad: CSV

Table 8. Classification representation of different approaches on the Microsoft dataset (%).

Approach	Accuracy	Precision	Recall	F1-Score
RandomForest SVM KNN LogisticRegression GBDT Our method	92.72 93.93 93.78 93.63 92.42 94.09	99.69 99.69 99.37 99.06 96.19 1	92.74 93.93 92.66 93.45 92.42 94.09	96.08 96.73 96.50 96.27 94.24 96.96

| Show Table

DownLoad: CSV

5. Conclusions and future work

We propose a new malware classification model based on malware visualization, and co-training of classifiers, and shows that combining the malware visual method with tri-training can provide a better discriminative pattern of malware families. In this framework, the malware is transformed into grayscale images by visual methods, then a fusion method based on CCA is utilized to fuse the local and global features extracted from the gray image to reduce time cost and improve feature relevance; finally, three basic classifiers are collaboratively training based on the tri-training schemes. In each round of collaborative learning, the new labeled samples are filtered by noise learning theory which ensures a continuous improvement of the overall performance of the co-learning results and alleviates the problem that the labeled samples are difficult to obtain in practical applications through the incorporate of unlabeled data into the training process.

The advantages of our method are manifold. Firstly, the experimental results show that the proposed method achieves good classification performances of 0.98 and 0.94 on Malimg dataset and Microsoft dataset, respectively. Second, our approach is more resistant to data imbalances. Thirdly, the tri-training algorithm improves the classification ability of the model through the introduction of a large number of cheap unlabeled samples and reduces the noise impact caused by the lack of labeled samples. Although the accuracy of the collaborative learning algorithm is improved after iterative training, it increases the time overhead. In future work, the iterative updating efficiency of the collaborative learning algorithm needs to be further improved, such as introduce some more complex models which are more suitable for image classification.

Acknowledgments

We would like to thank you for following the instructions above very closely in advance. It will definitely save us lot of time and expedite the process of your paper's publication.

Conflict of interest

The authors have no conflict of interest.

References

[1]	A. P. Namanya, A. Cullen, I. U. Awan, J. P. Disso, The world of Malware: An overview, in 2018 IEEE 6th International Conference on Future Internet of Things and Cloud (FiCloud), 2018, pp. 420-427, doi: 10.1109/FiCloud.2018.00067.
[2]	A. Martín, H. D. Menéndez, D. Camacho, MOCDroid: Multi-objective evolutionary classifier for Android malware detection, Soft. Comput., 21 (2017), 7405-7415. doi: 10.1007/s00500-016-2283-y
[3]	P. Foran, Of digital reliance, risk and resilience, Progres. Railro., 62 (2019), 30-30, 32-33.
[4]	K. Rieck, P. Trinius, C. Willems, T. Holz, Automatic analysis of malware behavior using machine learning, J. Comput. Secur., 19 (2011), 639-668. doi: 10.3233/JCS-2010-0410
[5]	F. Touchette, The evolution of malware, Network Security, 1 (2016), 11-14.
[6]	D. Gavrilut, M. Cimpoesu, A. Dan, L. Ciortuz, Malware detection using machine learning, Int. Multiconfer. Comput. Sci. Inform. Tech., 2010.
[7]	L. Nataraj, S. Karthikeyan, G. Jacob, B. S. Manjunath, Malware images: Visualization and automatic classification, Proceedings of the 8th International Symposium on Visualization for Cyber Security (VizSec), 2011, Available from: https://doi.org/10.1145/2016904.2016908.
[8]	W. Huang, J. W. Stokes, MtNet: A multi-task neural network for dynamic Malware classification, International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment, 2016.
[9]	J. Saxe, K. Berlin, Deep neural network based Malware detection using two dimensional binary program features, 2015 10th International Conference on Malicious and Unwanted Software (MALWARE), (2015), 11-20.
[10]	E. David, N. S. Netanyahu, DeepSign: Deep learning for automatic malware signature generation and classification, Internat. Joint Confer. Neural Networks, (2015), 1-8.
[11]	M. Ahmadi, D. Ulyanov, S. Semenov, M. Trofimov, G. Giacinto, Novel feature extraction, selection and fusion for effective Malware family classification, 6th ACM Conference on Data and Applications Security and Privacy (CODASPY), 2016.
[12]	Z. A. Genç, G. Lenzini, P. Y. A. Ryan, Next Generation Cryptographic Ransomware, (2018), 385-401.
[13]	J. Sahs, L. Khan, A machine learning approach to Android Malware detection, Intelligence and Security Informatics Conference (EISIC), (2012), 141-147.
[14]	X. G. Han, W. Qu, X. X. Yao, C. Y. Guo, F. Zhou, Research on malicious code variants detection based on texture fingerprint, J. Commun., 2014.
[15]	K. S. Han, J. H. Lim, B. Kang, E. G. Im, Malware analysis using visualized images and entropy graphs, Int. J. Inf. Secur., 14 (2015), 1-14. doi: 10.1007/s10207-014-0242-0
[16]	J. Y. Kim, S. J. Bu, S. B. Cho, Zero-day malware detection using transferred generative adversarial networks based on deep autoencoders, Inform. Ences, (2018), 83-102.
[17]	S. Shang, N. Zheng, X. Jian, M. Xu, H. P. Zhang, Detecting malware variants via function-call graph similarity, International Conference on Malicious & Unwanted Software, (2010), 113-120.
[18]	B. Anderson, C. Storlie, T. Lane, Improving malware classification: Bridging the static/dynamic gap, ACM Workshop on Security & Artificial Intelligence, 2012.
[19]	P. Zhang, B. Sun, R Ma, A Li, A novel visualization Malware detection method based on Spp-Net, 2019 IEEE 5th International Conference on Computer and Communications (ICCC), (2019), 510-514.
[20]	G. Xiao, J. Li, Y. Chen, K. Li, MalFCS: An effective malware classification framework with automated feature extraction based on deep convolutional neural networks, J. Parall. Distribut. Comput., 141 (2020), 49-58. doi: 10.1016/j.jpdc.2020.03.012
[21]	J. E. Engelen, H. H. Hoos, A survey on semi-supervised learning, Mach. Learn., 109 (2020), 373-440. doi: 10.1007/s10994-019-05855-6
[22]	K. Nigam, A. Mccallum, T. Mitchell, Semi-supervised text classification Using EM, MIT Press, (2006), 33-55.
[23]	F. D. Frumosu, M. Kulahci, Outliers detection using an iterative strategy for semi-supervised learning, Quality Reliab. Eng., 35 (2019).
[24]	Z. H. Zhou, M. Li, Tri-training: Exploiting unlabeled data using three classifiers, IEEE Trans. Knowl. Data Eng., 17 (2005), 1529-1541. doi: 10.1109/TKDE.2005.186
[25]	A. Blum, T. Mitchell, Combining labeled and unlabeled data with co-training, Proceed Conference Computer Learning, (1998), 92-100.
[26]	Y. Zhou, S. Goldman, Democratic co-learning, Proc. 16th IEEE Int. Conf. Tools Artif. Intell., (2004), 594-602.
[27]	D. G. Kong, G. H. Yan, Discriminant malware distance learning on structural information for automated malware classification, Performance Evaluation Review, 41 (2013), 347-348. doi: 10.1145/2494232.2465531
[28]	D. Angluin, P. Laird, Learning from noisy examples, Mach. Learn., 4 (1988), 343-370, Available from: https://doi.org/10.1007/BF00116829
[29]	M. J. Sullivan, Distribution of edaphic diatoms in a mississippi salt marsh: A canonical correlation analysis, J. Phycol., (1982), 130-133.
[30]	D. G. Lowe, Distinctive image features from scale-invariant keypoints, Int. J. Comput. Vision., 60 (2004), 91-110. doi: 10.1023/B:VISI.0000029664.99615.94
[31]	N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, IEEE Computer Society Conference on Computer Vision & Pattern Recognition (CVPR), 2005.
[32]	T. Ojala, M. Pietikäinen, T. Mäenpää, Gray scale and rotation invariant texture classification with local binary patterns, European Conference on Computer Vision (ECCV), (2000), 404-420.
[33]	J. Ma, X. Jiang, A. Fan, J. Jiang, J. Yan, Image matching from Handcrafted to deep features: A survey, Int. J. Comput. Vision, 1 (2020), 1-57.
[34]	T. Ahonen, J. Matas, H. Chu, M. Pietikäinen, Rotation invariant image description with local binary pattern histogram fourier features, Image Analys., (2009), 61-70.
[35]	H. Ran, W. Qi, Z. Guo, Feature reduction of multi-scale LBP for texture classification, 2015 International Conference on Intelligent Information Hiding and Multimedia Signal Processing (ⅡH-MSP), (2015), 397-400.
[36]	I. H. Witten, E. Frank, Data mining: Practical machine learning tools and techniques, Second Edition, ACM Sigmod. Record., 31 (2005), 76-77.
[37]	M. Haghighat, M. Abdel-Mottaleb, W. Alhalabi, Fully automatic face normalization and single sample face recognition in unconstrained environments, Expert Syst Appl., 47(2016), 23-34. doi: 10.1016/j.eswa.2015.10.047
[38]	R. Ronen, M. Radu, C. Feuerstein, E. Yom-Tov, M. Ahmadi, Microsoft Malware Classification Challenge, CORR, 2018.

This article has been cited by:

1.	Yu Ding, XiaoYu Zhang, BinBin Li, Jian Xing, Qian Qiang, ZiSen Qi, MengHan Guo, SiYu Jia, HaiPing Wang, 2022, Chapter 19, 978-3-031-17550-3, 287, 10.1007/978-3-031-17551-0_19
2.	Cheng-Jian Lin, Min-Su Huang, Chin-Ling Lee, Malware Classification Using Convolutional Fuzzy Neural Networks Based on Feature Fusion and the Taguchi Method, 2022, 12, 2076-3417, 12937, 10.3390/app122412937
3.	Paul Kiyambu Mvula, Paula Branco, Guy-Vincent Jourdan, Herna Lydia Viktor, A Survey on the Applications of Semi-supervised Learning to Cyber-security, 2024, 56, 0360-0300, 1, 10.1145/3657647
4.	Sawsan Alodibat, Ashraf Ahmad, Mohammad Azzeh, 2023, Explainable machine learning-based cybersecurity detection using LIME and Secml, 979-8-3503-2405-1, 235, 10.1109/JEEIT58638.2023.10185893
5.	Roberto Bruzzese, 2024, Building Visual Malware Dataset using VirusShare Data and Comparing Machine Learning Baseline Model to CoAtNet for Malware Classification, 9798400709234, 185, 10.1145/3651671.3651735

Reader Comments

Your name:*

Email:*
© 2021 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

Mathematical Biosciences and Engineering

3.9

Metrics

Article views(4374) PDF downloads(227) Cited by(5)

Preview PDF

Download XML

Export Citation

Article outline

Show full outline

Figures and Tables

Figures(4) / Tables(8)

Mathematical Biosciences and Engineering

Malware detection based on semi-supervised learning with malware visualization

Related Papers:

Abstract

1. Introduction

2. Related work

2.1. Development of malware detection

2.2. Semi-supervised learning algorithm

3. Methodology

3.1. Malware visualization

3.2. Feature extraction

3.2.1. Feature selection

3.2.2. Feature fusion

3.3. Tri-Training algorithm based on Noise learning judgment

4. Experiment

4.1. Dataset

4.2. Feature fusion methods analysis

4.3. Algorithm performance analysis

5. Conclusions and future work

Acknowledgments

Conflict of interest

References

This article has been cited by:

Reader Comments

通讯作者: 陈斌, bchen63@163.com

Metrics

Figures and Tables

Other Articles By Authors

Catalog

Mathematical Biosciences and Engineering

Malware detection based on semi-supervised learning with malware visualization

Related Papers:

Abstract

1. Introduction

2. Related work

2.1. Development of malware detection

2.2. Semi-supervised learning algorithm

3. Methodology

3.1. Malware visualization

3.2. Feature extraction

3.2.1. Feature selection

3.2.2. Feature fusion

3.3. Tri-Training algorithm based on Noise learning judgment

4. Experiment

4.1. Dataset

4.2. Feature fusion methods analysis

4.3. Algorithm performance analysis

5. Conclusions and future work

Acknowledgments

Conflict of interest

References

This article has been cited by:

Reader Comments

通讯作者: 陈斌, bchen63@163.com

Metrics

Figures and Tables

Other Articles By Authors

Related pages

Tools

Export File

Citation

Format

Content

Catalog