An innovative parameter optimization of Spark Streaming based on D3QN with Gaussian process regression

Hong Zhang; Zhenchao Xu; Yunxiang Wang; Yupeng Shen; Hong Zhang; Zhenchao Xu; Yunxiang Wang; Yupeng Shen

doi:10.3934/mbe.2023647

Mathematical Biosciences and Engineering

2023, Volume 20, Issue 8: 14464-14486. doi: 10.3934/mbe.2023647

Previous Article Next Article

Research article

An innovative parameter optimization of Spark Streaming based on D3QN with Gaussian process regression

1.
School of Cyber Security and Computer, Hebei University, Baoding, China
2.
Bureau of Geophysical Prospecting, Baoding, China

Academic Editor: Yang Kuang

Received: 05 March 2023 Revised: 07 June 2023 Accepted: 20 June 2023 Published: 03 July 2023

Nowadays, Spark Streaming, a computing framework based on Spark, is widely used to process streaming data such as social media data, IoT sensor data or web logs. Due to the extensive utilization of streaming media data analysis, performance optimization for Spark Streaming has gradually developed into a popular research topic. Several methods for enhancing Spark Streaming's performance include task scheduling, resource allocation and data skew optimization, which primarily focus on how to manually tune the parameter configuration. However, it is indeed very challenging and inefficient to adjust more than 200 parameters by means of continuous debugging. In this paper, we propose an improved dueling double deep Q-network (DQN) technique for parameter tuning, which can significantly improve the performance of Spark Streaming. This approach fuses reinforcement learning and Gaussian process regression to cut down on the number of iterations and speed convergence dramatically. The experimental results demonstrate that the performance of the dueling double DQN method with Gaussian process regression can be enhanced by up to 30.24%.
- Spark Streaming,
- Gaussian process regression,
- dueling double DQN,
- parameter optimization
Citation: Hong Zhang, Zhenchao Xu, Yunxiang Wang, Yupeng Shen. An innovative parameter optimization of Spark Streaming based on D3QN with Gaussian process regression[J]. Mathematical Biosciences and Engineering, 2023, 20(8): 14464-14486. doi: 10.3934/mbe.2023647

Related Papers:

Abstract

Nowadays, Spark Streaming, a computing framework based on Spark, is widely used to process streaming data such as social media data, IoT sensor data or web logs. Due to the extensive utilization of streaming media data analysis, performance optimization for Spark Streaming has gradually developed into a popular research topic. Several methods for enhancing Spark Streaming's performance include task scheduling, resource allocation and data skew optimization, which primarily focus on how to manually tune the parameter configuration. However, it is indeed very challenging and inefficient to adjust more than 200 parameters by means of continuous debugging. In this paper, we propose an improved dueling double deep Q-network (DQN) technique for parameter tuning, which can significantly improve the performance of Spark Streaming. This approach fuses reinforcement learning and Gaussian process regression to cut down on the number of iterations and speed convergence dramatically. The experimental results demonstrate that the performance of the dueling double DQN method with Gaussian process regression can be enhanced by up to 30.24%.

References

[1]	Apache storm. Available from: https://storm.apache.org/.
[2]	Apache spark streaming. Available from: https://spark.apache.org/docs/latest/streaming-programming-guide.html.
[3]	Apache flink. Available from: https://flink.apache.org/.
[4]	D. Cheng, X. Zhou, Y. Wang, C. Jiang, Adaptive scheduling parallel jobs with dynamic batching in spark streaming, IEEE Trans. Parallel Distrib. Syst., 29 (2018), 2672–2685. https://doi.org/10.1109/TPDS.2018.2846234 doi: 10.1109/TPDS.2018.2846234
[5]	H. Du, P. Han, Q. Xiang, S. Huang, Monkeyking: Adaptive parameter tuning on big data platforms with deep reinforcement learning, Big Data, 8 (2020), 270–290.
[6]	A. Kordelas, T. Spyrou, S. Voulgaris, V. Megalooikonomou, N. Deligiannis, KORD-I: A framework for real-time performance and cost optimization of apache Spark Streaming, in 2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), (2023), 1–3.
[7]	L. Liu, G. Shen, C. Guo, Y. Cui, C. Jiang, D. Wu, A spark streaming parameter optimization method based on deep reinforcement learning, Comput. Modernization, 2021 (2021), 49–56.
[8]	J. Wang, An intuitive tutorial to Gaussian processes regression, preprint, arXiv: 200910862. https://doi.org/10.48550/arXiv.2009.10862
[9]	Z. Wang, T. Schaul, M. Hessel, H. Hasselt, M. Lanctot, N. Freitas, Dueling network architectures for deep reinforcement learning, in Proceedings of Machine Learning Research, (2016), 1995–2003.
[10]	H. Van Hasselt, A. Guez, D. Silver, Deep reinforcement learning with double q-learning, in Proceedings of the AAAI conference on artificial intelligence, 30 (2016). https://doi.org/10.1609/aaai.v30i1.10295
[11]	E. Schulz, M. Speekenbrink, A. Krause, A tutorial on gaussian process regression: Modelling, exploring, and exploiting functions, J. Math. Psychol., 85 (2018), 1–16. https://doi.org/10.1016/j.jmp.2018.03.001 doi: 10.1016/j.jmp.2018.03.001
[12]	X. Wang, S. Wang, X. Liang, D. Zhao, J. Huang, X. Xu, et al., Deep reinforcement learning: A survey, IEEE Trans. Neural Networks Learn. Syst., 2022 (2022). https://doi.org/10.1109/TNNLS.2022.3207346 doi: 10.1109/TNNLS.2022.3207346
[13]	L. P. Swiler, M. Gulian, A. L. Frankel, C. Safta, J. D. Jakeman, A survey of constrained gaussian process regression: Approaches and implementation challenges, J. Machine Learn. Model. Comput., 1 (2020). https://doi.org/10.1615/JMachLearnModelComput.2020035155 doi: 10.1615/JMachLearnModelComput.2020035155
[14]	R. S. Sutton, A. G. Barto, Reinforcement Learning: An Introduction, MIT press, 2018.
[15]	Z. H. Zhou, Machine Learning, Springer Nature, 2021.
[16]	J. C. Lin, M. C. Lee, I. C. Yu, E. B. Johnsen, Modeling and simulation of spark streaming, in 2018 IEEE 32nd International Conference on Advanced Information Networking and Applications (AINA), (2018), 407–413. https://doi.org/10.1109/AINA.2018.00068
[17]	S. Venkataraman, A. Panda, K. Ousterhout, M. Armbrust, A. Ghodsi, M. J. Franklin, et al., Drizzle: Fast and adaptable stream processing at scale, in Proceedings of the 26th Symposium on Operating Systems Principles, (2017), 374–389. https://doi.org/10.1145/3132747.3132750
[18]	T. Ajila, S. Majumdar, Data driven priority scheduling on spark based stream processing, in 2018 IEEE/ACM 5th International Conference on Big Data Computin Applications and Technologies (BDCAT), (2018), 208–210. https://doi.org/10.1109/BDCAT.2018.00034
[19]	M. Petrov, N. Butakov, D. Nasonov, M. Melnik, Adaptive performance model for dynamic scaling Apache Spark Streaming, Proc. Comput. Sci., 136 (2018), 109–117.
[20]	W. Li, D. Niu, Y. Liu, S. Liu, B. Li, Wide-area spark streaming: Automated routing and batch sizing, IEEE Trans. Parallel Distrib. Syst., 30 (2018), 1434–1448.
[21]	H. Zhao, L. B. Yao, Z. X. Zeng, D. H. Li, J. L. Xie, W. L. Zhu, et al., An edge streaming data processing framework for autonomous driving, Connect. Sci., 33 (2021), 173–200. https://doi.org/10.1080/09540091.2020.1782840 doi: 10.1080/09540091.2020.1782840
[22]	B. Liu, X. Tan, W. Cao, Dynamic resource allocation strategy in spark streaming, J. Comput. Appl., 37 (2017), 1574. https://doi.org/10.11772/j.issn.1001-9081.2017.06.1574 doi: 10.11772/j.issn.1001-9081.2017.06.1574
[23]	D. Cheng, Y. Chen, X. Zhou, D. Gmach, D. Milojicic, Adaptive scheduling of parallel jobs in spark streaming, in IEEE INFOCOM 2017-IEEE Conference on Computer Communications, (2017), 1–9. https://doi.org/10.1109/INFOCOM.2017.8057206
[24]	B. R. Prasad, S. Agarwal, Performance analysis and optimization of spark streaming applications through effective control parameters tuning, in Progress in Intelligent Computing Techniques: Theory, Practice, and Applications, Springer, (2018), 99–110. https://doi.org/10.1007/978-981-10-3376-6_11
[25]	G. Liu, X. Zhu, J. Wang, D. Guo, W. Bao, H. Guo, SP-Partitioner: A novel partition method to handle intermediate data skew in spark streaming, Future Gener. Comput. Syst., 86 (2018), 1054–1063. https://doi.org/10.1016/j.future.2017.07.014 doi: 10.1016/j.future.2017.07.014
[26]	Z. Fu, Z. Tang, L. Yang, K. Li, K. Li, Imrp: A predictive partition method for data skew alleviation in spark streaming environment, Parallel Comput., 100 (2020), 102699. https://doi.org/10.1016/j.parco.2020.102699 doi: 10.1016/j.parco.2020.102699
[27]	J. Clifton, E. Laber, Q-learning: Theory and applications, Ann. Rev. Stat. Appl., 7 (2020), 279–301. https://doi.org/10.1146/annurev-statistics-031219-041220 doi: 10.1146/annurev-statistics-031219-041220
[28]	V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, et al., Human-level control through deep reinforcement learning, Nature, 518 (2015), 529–533. https://doi.org/10.1038/nature14236 doi: 10.1038/nature14236
[29]	W. Samek, G. Montavon, S. Lapuschkin, C. J. Anders, K. R. Müller, Explaining deep neural networks and beyond: A review of methods and applications, Proc. IEEE, 109 (2021), 247–278. https://doi.org/10.1109/JPROC.2021.3060483 doi: 10.1109/JPROC.2021.3060483
[30]	M. G. Titelbaum, Fundamentals of Bayesian Epistemology 2: Arguments, Challenges, Alternatives, Oxford University Press, 2022.
[31]	B. Gaye, D. Zhang, A. Wulamu, Improvement of support vector machine algorithm in big data background, Math. Prob. Eng., 2021 (2021), 1–9. https://doi.org/10.1155/2021/5594899 doi: 10.1155/2021/5594899
[32]	N. Ihde, P. Marten, A. Eleliemy, G. Poerwawinata, P. Silva, I. Tolovski, et al., A survey of big data, high performance computing, and machine learning benchmarks, in Performance Evaluation and Benchmarking: 13th TPC Technology Conference, (2021), 98–118. https://doi.org/10.1007/978-3-030-94437-7_7
[33]	Datanami, Kafka Tops 1 Trillion Messages Per Day at LinkedIn. Available from: https://goo.gl/cY7VOz.
[34]	Observability at twitter: Technical overview. Available from: https://goo.gl/wAHi2I.
[35]	Q. Ye, W. Liu, C. Q. Wu, Nostop: A novel configuration optimization scheme for Spark Streaming, in 50th International Conference on Parallel Processing, (2021), 1–10.
[36]	Spark tuning guide. Available from: https://spark.apache.org/docs/latest/streaming-programming-guide.html#deploying-applications.

Reader Comments

Your name:*

Email:*
© 2023 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)