Spam is any form of annoying and unsought digital communication sent in bulk and may contain offensive content feasting viruses and cyber-attacks. The voluminous increase in spam has necessitated developing more reliable and vigorous artificial intelligence-based anti-spam filters. Besides text, an email sometimes contains multimedia content such as audio, video, and images. However, text-centric email spam filtering employing text classification techniques remains today's preferred choice. In this paper, we show that text pre-processing techniques nullify the detection of malicious contents in an obscure communication framework. We use Spamassassin corpus with and without text pre-processing and examined it using machine learning (ML) and deep learning (DL) algorithms to classify these as ham or spam emails. The proposed DL-based approach consistently outperforms ML models. In the first stage, using pre-processing techniques, the long-short-term memory (LSTM) model achieves the highest results of 93.46% precision, 96.81% recall, and 95% F1-score. In the second stage, without using pre-processing techniques, LSTM achieves the best results of 95.26% precision, 97.18% recall, and 96% F1-score. Results show the supremacy of DL algorithms over the standard ones in filtering spam. However, the effects are unsatisfactory for detecting encrypted communication for both forms of ML algorithms.
Citation: Khan Farhan Rafat, Qin Xin, Abdul Rehman Javed, Zunera Jalil, Rana Zeeshan Ahmad. Evading obscure communication from spam emails[J]. Mathematical Biosciences and Engineering, 2022, 19(2): 1926-1943. doi: 10.3934/mbe.2022091
Spam is any form of annoying and unsought digital communication sent in bulk and may contain offensive content feasting viruses and cyber-attacks. The voluminous increase in spam has necessitated developing more reliable and vigorous artificial intelligence-based anti-spam filters. Besides text, an email sometimes contains multimedia content such as audio, video, and images. However, text-centric email spam filtering employing text classification techniques remains today's preferred choice. In this paper, we show that text pre-processing techniques nullify the detection of malicious contents in an obscure communication framework. We use Spamassassin corpus with and without text pre-processing and examined it using machine learning (ML) and deep learning (DL) algorithms to classify these as ham or spam emails. The proposed DL-based approach consistently outperforms ML models. In the first stage, using pre-processing techniques, the long-short-term memory (LSTM) model achieves the highest results of 93.46% precision, 96.81% recall, and 95% F1-score. In the second stage, without using pre-processing techniques, LSTM achieves the best results of 95.26% precision, 97.18% recall, and 96% F1-score. Results show the supremacy of DL algorithms over the standard ones in filtering spam. However, the effects are unsatisfactory for detecting encrypted communication for both forms of ML algorithms.
[1] | C. M. Habito, A. Morgan, C. Vaughan, 'direct'and 'instant': the role of digital technology and social media in young filipinos' intimate relationships, Cult., Health & Sexual., 1–19. doi: 10.1080/13691058.2021.1877825. |
[2] | M. U. Khan, A. R. Javed, M. Ihsan, U. Tariq, A novel category detection of social media reviews in the restaurant industry, Multimedia Syst., 1–14. doi: 10.1007/s00530-020-00704-2. |
[3] | M. Hina, M. Ali, A. R. Javed, F. Ghabban, L. A. Khan, Z. Jalil, Sefaced: Semantic-based forensic analysis and classification of e-mail data using deep learning, IEEE Access, 9 (2021), 98398–98411. doi: 10.1109/ACCESS.2021.3095730. doi: 10.1109/ACCESS.2021.3095730 |
[4] | R. Kong, H. Zhu, J. A. Konstan, Learning to ignore: A case study of organization-wide bulk email effectiveness, in Proceedings of the ACM on Human-Computer Interaction, 5 (2021), 1–23. doi: 10.1145/3479861. |
[5] | E. Kiselev, Trends and features of russian business email: Contrastive analysis based on materials from business communication textbooks, Jpn. Sl. East Eur. Stud., 41 (2021), 18–41. |
[6] | M. Hina, M. Ali, A. R. Javed, G. Srivastava, T. R. Gadekallu, Z. Jalil, Email classification and forensics analysis using ML, in 2021 IEEE SmartWorld, Ubiquitous Intelligence Computing, Advanced Trusted Computing, Scalable Computing Communications, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/IOP/SCI), 2021,630–635. doi: 10.1109/SWC50871.2021.00093. |
[7] | W. Ahmed, A. Rasool, A. R. Javed, N. Kumar, T. R. Gadekallu, Z. Jalil, et al., Security in next generation mobile payment systems: A comprehensive survey, IEEE Access, 9 (2021), 115932–115950. doi: 10.1109/ACCESS.2021.3105450. doi: 10.1109/ACCESS.2021.3105450 |
[8] | A. R. Javed, S. U. Rehman, M. U. Khan, M. Alazab, H. U. Khan, Betalogger: Smartphone sensor-based side-channel attack detection and text inference using language modeling and dense multilayer neural network, Trans. Asian Low-Res. Lang. Inf. Process., 20 (2021), 1–17. doi: 10.1145/3460392. doi: 10.1145/3460392 |
[9] | A. R. Javed, M. O. Beg, M. Asim, T. Baker, A. H. Al-Bayatti, Alphalogger: Detecting motion-based side-channel attack using smartphone keystrokes, J. Ambient Intell. Human. Comput., 1–14. doi: 10.1007/s12652-020-01770-0. |
[10] | A. Basit, M. Zafar, A. R. Javed, Z. Jalil, A novel ensemble machine learning method to detect phishing attack, in 2020 IEEE 23rd International Multitopic Conference (INMIC), IEEE, 2020, 1–5. doi: 10.1109/INMIC50486.2020.9318210. |
[11] | A. Basit, M. Zafar, X. Liu, A. R. Javed, Z. Jalil, K. Kifayat, A comprehensive survey of ai-enabled phishing attacks detection techniques, Telecommun. Syst., 76 (2021), 139–154. doi: 10.1007/s11235-020-00733-2. doi: 10.1007/s11235-020-00733-2 |
[12] | S. ur Rehman, M. Khaliq, S. I. Imtiaz, A. Rasool, M. Shafiq, A. R. Javed, et al., Diddos: An approach for detection and identification of distributed denial of service (ddos) cyberattacks using gated recurrent units (gru), Future Gener. Comput. Syst., 118 (2021), 453–466. doi: 10.1016/j.future.2021.01.022. doi: 10.1016/j.future.2021.01.022 |
[13] | S. I. Imtiaz, S. ur Rehman, A. R. Javed, Z. Jalil, X. Liu, W. S. Alnumay, Deepamd: Detection and identification of android malware using high-efficient deep artificial neural network, Future Gener. Comput. Syst., 115 (2021), 844–856. doi: 10.1016/j.future.2020.10.008. doi: 10.1016/j.future.2020.10.008 |
[14] | T. Conley, J. Kalita, Language model metrics and procrustes analysis for improved vector transformation of nlp embeddings, preprint, arXiv: 2106.02490. |
[15] | L. Kumar, A secure communication with one-time pad encryption and steganography method in cloud, Turk. J. Comput. Math. Educ. (TURCOMAT), 12 (2021), 2567–2576. doi: 10.1007/s00779-021-01607-3. doi: 10.1007/s00779-021-01607-3 |
[16] | R. Abid, C. Iwendi, A. R. Javed, M. Rizwan, Z. Jalil, J. H. Anajemba, et al., An optimised homomorphic crt-rsa algorithm for secure and efficient communication, Pers. Ubiquitous Comput., 1–14. doi: 10.1007/s00779-021-01607-3. |
[17] | B. Ahuja, R. Doriya, Visual chaos steganography with fractional transform, in Soft Computing and Signal Processing, Springer, 2021,295–304. |
[18] | Q. Li, X. Wang, B. Ma, X. Wang, C. Wang, Z. Xia, Y. Shi, Image steganography based on style transfer and quaternion exponent moments, Appl. Soft Comput., 107618. doi: 10.1016/j.asoc.2021.107618. |
[19] | L. Serpa-Andrade, R. Garcia-Velez, E. Pinos-Velez, C. Flores-Urgilez, Analysis of the application of steganography applied in the field of cybersecurity, in International Conference on Applied Human Factors and Ergonomics, Springer, 2021,366–371. |
[20] | C. Iwendi, Z. Jalil, A. R. Javed, T. Reddy, R. Kaluri, G. Srivastava, et al., Keysplitwatermark: Zero watermarking algorithm for software protection against cyber-attacks, IEEE Access, 8 (2020), 72650–72660. doi: 10.1109/ACCESS.2020.2988160. doi: 10.1109/ACCESS.2020.2988160 |
[21] | D. A. Putri, D. A. Kristiyanti, E. Indrayuni, A. Nurhadi and D. R. Hadinata, Comparison of naive bayes algorithm and support vector machine using pso feature selection for sentiment analysis on e-wallet review, in Journal of Physics: Conference Series, 1641 (2020), 012085. doi: 10.1088/1742-6596/1641/1/012085. |
[22] | A. Mishra, J. A. Latorre, J. Pool, D. Stosic, D. Stosic, G. Venkatesh, et al., Accelerating sparse deep neural networks, preprint, arXiv: 2104.08378. |
[23] | M. Ramprasad, N. H. Chowdary, K. J. Reddy, V. Gaurav, Email spam detection using python & machine learning, Turk. J. Phys. Rehabil., 32 (2019), 3. |
[24] | M. Eriksson, G. Heuguet, Genealogies of online content identification-an introduction, Int. Hist., 5 (2021), 1–7. doi: 10.1080/24701475.2021.1878649. doi: 10.1080/24701475.2021.1878649 |
[25] | M. Neha, M. S. Nair, A novel twitter spam detection technique by integrating inception network with attention based lstm, in 2021 5th International Conference on Trends in Electronics and Informatics (ICOEI), IEEE, 2021, 1009–1014. doi: 10.1109/ICOEI51242.2021.9452825. |
[26] | F. Iqbal, R. Batool, B. C. Fung, S. Aleem, A. Abbasi, A. R. Javed, Toward tweet-mining framework for extracting terrorist attack-related information and reporting, IEEE Access, 9 (2021), 115535–115547. doi: 10.1109/ACCESS.2021.3102040. doi: 10.1109/ACCESS.2021.3102040 |
[27] | S. E. Rahman, S. Ullah, Email spam detection using bidirectional long short term memory with convolutional neural network, in 2020 IEEE Region 10 Symposium (TENSYMP), IEEE, 2020, 1307–1311. doi: 10.1109/TENSYMP50017.2020.9230769. |
[28] | N. Garba, S. Rakshit, C. D. Maa, N. R. Vajjhala, An email content-based insider threat detection model using anomaly detection algorithms, in Proceedings of the International Conference on Innovative Computing Communication (ICICC) 2021, 2021. doi: 10.2139/ssrn.3833744. |
[29] | T. Sharma, P. Ferronato, M. Bashir, Phishing email detection method: Leveraging data across different organizations, 2020. |
[30] | S. Afzal, M. Asim, A. R. Javed, M. O. Beg, T. Baker, Urldeepdetect: A deep learning approach for detecting malicious urls using semantic vector models, J. Network Syst. Manage., 29 (2021), 1–27. doi: 10.1007/s10922-021-09587-8. doi: 10.1007/s10922-021-09587-8 |
[31] | R. Chiramdasu, G. Srivastava, S. Bhattacharya, P. K. Reddy, T. R. Gadekallu, Malicious url detection using logistic regression, in 2021 IEEE International Conference on Omni-Layer Intelligent Systems (COINS), IEEE, 2021, 1–6. doi: 10.1109/COINS51742.2021.9524269. |
[32] | C. Rupa, G. Srivastava, S. Bhattacharya, P. Reddy, T. R. Gadekallu, A machine learning driven threat intelligence system for malicious url detection, in The 16th International Conference on Availability, Reliability and Security, 2021, 1–7. doi: 10.1145/3465481.3470029. |
[33] | B. Aguirre, Steganography in Contemporary Cyberattacks and the Link to Child Pornography, PhD thesis, Utica College, 2020. |
[34] | R. Singh, Analysis of spam email filtering through naive bayes algorithm across different datasets. |
[35] | S. Srinivasan, V. Ravi, M. Alazab, S. Ketha, A. Z. Ala'M, S. K. Padannayil, Spam emails detection based on distributed word embedding with deep learning, in Machine Intelligence and Big Data Analytics for Cybersecurity Applications, Springer, 2021,161–189. doi: 10.1002/9781119701859.ch6. |
[36] | A. N. Soni, Spam-e-mail-detection-using-advanced-deep-convolution-neuralnetwork-algorithms, J. Innovative Dev. Pharm. Tech. Sci., 2 (2019), 74–80. doi: 10.1007/s35146-018-0155-y. doi: 10.1007/s35146-018-0155-y |
[37] | J. Rastenis, S. Ramanauskaitė, I. Suzdalev, K. Tunaitytė, J. Janulevičius, A. Čenys, Multi-language spam/phishing classification by email body text: Toward automated security incident investigation, Electronics, 10 (2021), 668. doi: 10.3390/electronics10060668. doi: 10.3390/electronics10060668 |
[38] | S. Manjula, M. Shivamurthaiah, Identification of languages from the text document using natural language processing system, Turk. J. Comput. Math. Educ. (TURCOMAT), 12 (2021), 2465–2472. |
[39] | M. Mukhanova, Text normalization and spelling correction in kazakh language. |
[40] | A. M. Alhassan, W. M. N. W. Zainon, Review of feature selection, dimensionality reduction and classification for chronic disease diagnosis, IEEE Access. 9 (2021), 87310–87317. doi: 10.1109/ACCESS.2021.3088613. doi: 10.1109/ACCESS.2021.3088613 |
[41] | M. Ontivero-Ortega, A. Lage-Castellanos, G. Valente, R. Goebel, M. Valdes-Sosa, Fast gaussian naïve bayes for searchlight classification analysis, Neuroimage, 163 (2017), 471–479. doi: 10.1016/j.neuroimage.2017.09.001. doi: 10.1016/j.neuroimage.2017.09.001 |
[42] | A. R. Javed, Z. Jalil, W. Zehra, T. R. Gadekallu, D. Y. Suh, M. J. Piran, A comprehensive survey on digital video forensics: Taxonomy, challenges, and future directions, Eng. Appl. Artif. Intell., 106 (2021), 104456. doi: 10.1016/j.engappai.2021.104456. doi: 10.1016/j.engappai.2021.104456 |
[43] | S. Visa, B. Ramsay, A. L. Ralescu, E. Van Der Knaap, Confusion matrix-based feature selection, MAICS, 710 (2011), 120–127. doi: 10.3917/trans.120.0127. doi: 10.3917/trans.120.0127 |
[44] | A. Mann, O. Höft, Categorization of swedish e-mails using supervised machine learning, 2021. |
[45] | V. Karunakaran, V. Rajasekar, S. I. T. Joseph, Exploring a filter and wrapper feature selection techniques in machine learning, in Computational Vision and Bio-Inspired Computing, Springer, 2021,497–506. |
[46] | N. P. Wosah, T. Win, Phishing mitigation techniques: A literature survey, preprint, arXiv: 2104.06989. doi: 10.5121/ijnsa.2021.13205. |
[47] | A. El Kah, I. Zeroual, The effects of pre-processing techniques on arabic text classification, Int. J., 10. |
[48] | T. Mehrotra, G. K. Rajput, M. Verma, B. Lakhani, N. Singh, Email spam filtering technique from various perspectives using machine learning algorithms, in Data Driven Approach Towards Disruptive Technologies: Proceedings of MIDAS 2020, Springer Singapore, 2021,423–432. doi: 10.1007/978-981-15-9873-9-33. |
[49] | S. P. Shyry, Y. B. Jinila, Detection and prevention of spam mail with semantics-based text classification of collaborative and content filtering, in Journal of Physics: Conference Series, 1770 (2021), 012031. doi: 10.1088/1742-6596/1770/1/012031. |