With the abundance of raw data generated from various sources including social networks, big data has become essential in acquiring, processing, and analyzing heterogeneous data from multiple sources for real-time applications. In this paper, we propose a big data framework suitable for pre‑processing and classification of image as well as text analytics by employing two key workflows, called big data (BD) pipeline and machine learning (ML) pipeline. Our unique end-to-end workflow integrates data cleansing, data integration, data transformation and data reduction processes, followed by various analytics using suitable machine learning techniques. Further, our model is the first of its kind to augment facial recognition with sentiment analysis in a distributed big data framework. The implementation of our model uses state-of-the-art distributed technologies to ingest, prepare, process and analyze big data for generating actionable data insights by employing relevant ML algorithms such as k-NN, logistic regression and decision tree. In addition, we demonstrate the application of our big data framework to facial recognition system using open sources by developing a prototype as a use case. We also employ sentiment analysis on non-repetitive semi structured public data (text) such as user comments, image tagging, and other information associated with the facial images. We believe our work provides a novel approach to intersect Big Data, ML and Face Recognition and would create new research to alleviate some of the challenges associated with big data processing in real world applications.
Citation: Suriya Priya R Asaithambi, Sitalakshmi Venkatraman, Ramanathan Venkatraman. Proposed big data architecture for facial recognition using machine learning[J]. AIMS Electronics and Electrical Engineering, 2021, 5(1): 68-92. doi: 10.3934/electreng.2021005
With the abundance of raw data generated from various sources including social networks, big data has become essential in acquiring, processing, and analyzing heterogeneous data from multiple sources for real-time applications. In this paper, we propose a big data framework suitable for pre‑processing and classification of image as well as text analytics by employing two key workflows, called big data (BD) pipeline and machine learning (ML) pipeline. Our unique end-to-end workflow integrates data cleansing, data integration, data transformation and data reduction processes, followed by various analytics using suitable machine learning techniques. Further, our model is the first of its kind to augment facial recognition with sentiment analysis in a distributed big data framework. The implementation of our model uses state-of-the-art distributed technologies to ingest, prepare, process and analyze big data for generating actionable data insights by employing relevant ML algorithms such as k-NN, logistic regression and decision tree. In addition, we demonstrate the application of our big data framework to facial recognition system using open sources by developing a prototype as a use case. We also employ sentiment analysis on non-repetitive semi structured public data (text) such as user comments, image tagging, and other information associated with the facial images. We believe our work provides a novel approach to intersect Big Data, ML and Face Recognition and would create new research to alleviate some of the challenges associated with big data processing in real world applications.
[1] | Chen M, Mao S, Liu Y (2014) Big data: A survey. Mobile networks and applications 19: 171-209. doi: 10.1007/s11036-013-0489-0 |
[2] | McAfee A, Brynjolfsson E, Davenport TH, et al. (2012) Big data: the management revolution. Harvard business review 90: 60-68. |
[3] | Venkatraman R, Venkatraman S (2019) Big Data Infrastructure, Data Visualisation and Challenges. Proceedings of the 3rd International Conference on Big Data and Internet of Things, 13-17. |
[4] | Labrinidis A, Jagadish HV (2012) Challenges and opportunities with big data. Proceedings of the VLDB Endowment 5: 2032-2033. doi: 10.14778/2367502.2367572 |
[5] | Venkatraman S, Venkatraman R (2019) Big data security challenges and strategies. AIMS MATHEMATICS 4: 860-879. doi: 10.3934/math.2019.3.860 |
[6] | Masi I, Wu Y, Hassner T, et al. (2018) Deep face recognition: A survey. 2018 31st SIBGRAPI conference on graphics, patterns and images (SIBGRAPI), 471-478. |
[7] | Singh A, Bhadani R (2020) Mobile Deep Learning with TensorFlow Lite, ML Kit and Flutter. Packt Publishing. |
[8] | Zhu Y, Jiang Y (2020) Optimization of face recognition algorithm based on deep learning multi feature fusion driven by big data. Image Vision Comput 104: 104023. doi: 10.1016/j.imavis.2020.104023 |
[9] | Reddy KS, Krishna VV, Kumar VV (2016) A Method for Facial Recognition Based On Local Features. International Journal of Mathematics and Computation 27: 98-109. |
[10] | Qateef JS, Kazm AA (2016) Facial expression recognition via mapreduce assisted k-nearest neighbor algorithm. International Journal of Computer Science and Information Security 14: 170. |
[11] | Sirovich L, Kirby M (1987) Low-dimensional procedure for the characterization of human faces. Josa a 4: 519-524. doi: 10.1364/JOSAA.4.000519 |
[12] | Turk MA, Pentland AP (1991) Face recognition using eigenfaces. Proceedings. 1991 IEEE computer society conference on computer vision and pattern recognition, 586-587. IEEE Computer Society. |
[13] | Bruce V, Young A (1986) Understanding face recognition. British journal of psychology 77: 305-327. doi: 10.1111/j.2044-8295.1986.tb02199.x |
[14] | Parkhi OM, Vedaldi A, Zisserman A (2015) Deep face recognition. |
[15] | He X, Yan S, Hu Y, et al. (2005) Face recognition using Laplacianfaces. IEEE T Pattern Anal 27: 328-340. doi: 10.1109/TPAMI.2005.55 |
[16] | Deng J, Guo J, Xue N, et al. (2019) Arcface: Additive angular margin loss for deep face recognition. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4690-4699. |
[17] | Zhou E, Cao Z, Yin Q (2015) Naive-deep face recognition: Touching the limit of LFW benchmark or not? arXiv preprint arXiv: 150104690. |
[18] | Wang H, Wang Y, Zhou Z, et al. (2018) Cosface: Large margin cosine loss for deep face recognition. Proceedings of the IEEE conference on computer vision and pattern recognition, 5265-5274. |
[19] | Wen Y, Zhang K, Li Z, et al. (2016) A discriminative feature learning approach for deep face recognition. European conference on computer vision, 499-515. |
[20] | Deng J, Zhou Y, Zafeiriou S (2017) Marginal loss for deep face recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 60-68. |
[21] | Ding H, Zhou SK, Chellappa R (2017) Facenet2expnet: Regularizing a deep face recognition net for expression recognition. 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), 118-126. |
[22] | Wang F, Chen L, Li C, et al. (2018) The devil of face recognition is in the noise. Proceedings of the European Conference on Computer Vision (ECCV), 765-780. |
[23] | Benhlima L (2018) Big data management for healthcare systems: architecture, requirements, and implementation. Advances in bioinformatics 2018. |
[24] | Asaithambi SPR, Venkatraman R, Venkatraman S (2020) MOBDA: Microservice-Oriented Big Data Architecture for Smart City Transport Systems. Big Data and Cognitive Computing 4: 17. doi: 10.3390/bdcc4030017 |
[25] | Costa C, Santos MY (2016) BASIS: A big data architecture for smart cities. 2016 SAI Computing Conference (SAI), 1247-1256. |
[26] | He X, Wang K, Huang H, et al. (2018) QoE-driven big data architecture for smart city. IEEE Commun Mag 56: 88-93. doi: 10.1109/MCOM.2018.1700231 |
[27] | Lopez D, Manogaran G (2016) Big data architecture for climate change and disease dynamics. The human element of big data: issues, analytics, and performance, 301-331. |
[28] | Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Communications of the ACM 51: 107-113. doi: 10.1145/1327452.1327492 |
[29] | Peralta D, Del Río S, Ramírez-Gallego S, et al. (2015) Evolutionary feature selection for big data classification: A mapreduce approach. Math Probl Eng 2015. |
[30] | Gao W, Zhao X, Gao Z, et al. (2019) 3D Face Reconstruction From Volumes of Videos Using a Mapreduce Framework. IEEE Access 7: 165559-165570. doi: 10.1109/ACCESS.2019.2938671 |
[31] | Mahmoud SM, Habeeb RS (2019) Analysis of Large Set of Images Using MapReduce Framework. International Journal of Modern Education and Computer Science 11: 47. doi: 10.5815/ijmecs.2019.12.05 |
[32] | Apache Spark™. A unified analytics engine for large-scale data processing. |
[33] | Hazarika AV, Ram GJSR, Jain E (2017) Performance comparision of Hadoop and spark engine. 2017 International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud)(I-SMAC), 671-674. |
[34] | Zaharia M, Chowdhury M, Das T, et al. (2012) Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. 9th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 12), 15-28. |
[35] | Luengo J, García S, Herrera F (2012) On the choice of the best imputation methods for missing values considering three groups of classification methods. Knowl Inf Syst 32: 77-108. doi: 10.1007/s10115-011-0424-2 |
[36] | Batista GE, Monard MC (2003) An analysis of four missing data treatment methods for supervised learning. Appl Artif Intell 17: 519-533. doi: 10.1080/713827181 |
[37] | Wilson DL (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE T Syst Man Cy B, 408-421. |
[38] | Sánchez JS, Barandela R, Marqués AI, et al. (2003) Analysis of new techniques to obtain quality training sets. Pattern Recogn Lett 24: 1015-1022. doi: 10.1016/S0167-8655(02)00225-8 |
[39] | Garcia S, Derrac J, Cano J, et al. (2012) Prototype selection for nearest neighbor classification: Taxonomy and empirical study. IEEE T Pattern Anal 34: 417-435. doi: 10.1109/TPAMI.2011.142 |
[40] | Triguero I, Derrac J, Garcia S, et al. (2011) A taxonomy and experimental study on prototype generation for nearest neighbor classification. IEEE T Syst Man Cy C 42: 86-100. doi: 10.1109/TSMCC.2010.2103939 |
[41] | García-Gil D, Luengo J, García S, et al. (2019) Enabling smart data: noise filtering in big data classification. Inform Sciences 479: 135-152. doi: 10.1016/j.ins.2018.12.002 |
[42] | Xue B, Zhang M, Browne WN, et al. (2015) A survey on evolutionary computation approaches to feature selection. IEEE T Evolut Comput 20: 606-626. doi: 10.1109/TEVC.2015.2504420 |
[43] | Navot A, Shpigelman L, Tishby N, et al. (2005) Nearest neighbor based feature selection for regression and its application to neural activity. Advances in neural information processing systems 18: 996-1002. |
[44] | Ramírez-Gallego S, García S, Xiong N, et al. (2018) BELIEF: A distance-based redundancy-proof feature selection method for Big Data. arXiv preprint arXiv: 180405774. |
[45] | Triguero I, Peralta D, Bacardit J, et al. (2015) MRPR: a MapReduce solution for prototype reduction in big data classification. Neurocomputing 150: 331-345. doi: 10.1016/j.neucom.2014.04.078 |
[46] | García-Gil D, Ramírez-Gallego S, García S, et al. (2018) On the Use of Random Discretization and Dimensionality Reduction in Ensembles for Big Data. International Conference on Hybrid Artificial Intelligence Systems, 15-26. |
[47] | Triguero I, Galar M, Vluymans S, et al. (2015) Evolutionary undersampling for imbalanced big data classification. 2015 IEEE Congress on Evolutionary Computation (CEC), 715-722. |
[48] | Triguero I, Galar M, Merino D, et al. (2016) Evolutionary undersampling for extremely imbalanced big data classification under apache spark. 2016 IEEE Congress on Evolutionary Computation (CEC), 640-647. |
[49] | Ramírez-Gallego S, García S, Benítez JM, et al. (2018) A distributed evolutionary multivariate discretizer for big data processing on apache spark. Swarm Evol Comput 38: 240-250. doi: 10.1016/j.swevo.2017.08.005 |
[50] | Maillo J, Triguero I, Herrera F (2015) A mapreduce-based k-nearest neighbor approach for big data classification. 2015 IEEE Trustcom/BigDataSE/ISPA 2: 167-172. doi: 10.1109/Trustcom.2015.577 |
[51] | Maillo J, Ramírez S, Triguero I, et al. (2017) kNN-IS: An Iterative Spark-based design of the k-Nearest Neighbors classifier for big data. Knowl-Based Syst 117: 3-15. doi: 10.1016/j.knosys.2016.06.012 |
[52] | Deng Z, Zhu X, Cheng D, et al. (2016) Efficient kNN classification algorithm for big data. Neurocomputing 195: 143-148. doi: 10.1016/j.neucom.2015.08.112 |
[53] | Gallego A-J, Calvo-Zaragoza J, Valero-Mas JJ, et al. (2018) Clustering-based k-nearest neighbor classification for large-scale data with neural codes representation. Pattern Recogn 74: 531-543. doi: 10.1016/j.patcog.2017.09.038 |
[54] | Wang F, Wang Q, Nie F, et al. (2018) Efficient tree classifiers for large scale datasets. Neurocomputing 284: 70-79. doi: 10.1016/j.neucom.2017.12.061 |