Big data integration enhancement based on attributes conditional dependency and similarity index method

Vishnu Vandana Kolisetty; Dharmendra Singh Rajput; Vishnu Vandana Kolisetty; Dharmendra Singh Rajput

doi:10.3934/mbe.2021429

Mathematical Biosciences and Engineering

2021, Volume 18, Issue 6: 8661-8682. doi: 10.3934/mbe.2021429

Previous Article Next Article

Research article Special Issues

Big data integration enhancement based on attributes conditional dependency and similarity index method

Vishnu Vandana Kolisetty ¹,
Dharmendra Singh Rajput ^{2
,
,}

1.
SCOPE, Vellore Institute of Technology, Vellore 632014, India
2.
SITE, Vellore Institute of Technology, Vellore 632014, India

Received: 01 July 2021 Accepted: 16 September 2021 Published: 11 October 2021

Big data has attracted a lot of attention in many domain sectors. The volume of data-generating today in every domain in form of digital is enormous and same time acquiring such information for various analyses and decisions is growing in every field. So, it is significant to integrate the related information based on their similarity. But the existing integration techniques are usually having processing and time complexity and even having constraints in interconnecting multiple data sources. Many of these sources of information come from a variety of sources. Due to the complex distribution of many different data sources, it is difficult to determine the relationship between the data, and it is difficult to study the same data structures for integration to effectively access or retrieve data to meet the needs of different data analysis. In this paper, proposed an integration of big data with computation of attribute conditional dependency (ACD) and similarity index (SI) methods termed as ACD-SI. The ACD-SI mechanism allows using of an improved Bayesian mechanism to analyze the distribution of attributes in a document in the form of dependence on possible attributes. It also uses attribute conversion and selection mechanisms for mapping and grouping data for integration and uses methods such as LSA (latent semantic analysis) to analyze the content of data attributes to extract relevant and accurate data. It performs a series of experiments to measure the overall purity and normalization of the data integrity, using a large dataset of bibliographic data from various publications. The obtained purity and NMI ratio confined the clustered data relevancy and the measure of precision, recall, and accurate rate justified the improvement of the proposal is compared to the existing approaches.
- integration attributes dependency,
- similarity index,
- big data
Citation: Vishnu Vandana Kolisetty, Dharmendra Singh Rajput. Big data integration enhancement based on attributes conditional dependency and similarity index method[J]. Mathematical Biosciences and Engineering, 2021, 18(6): 8661-8682. doi: 10.3934/mbe.2021429

Related Papers:

Abstract

Big data has attracted a lot of attention in many domain sectors. The volume of data-generating today in every domain in form of digital is enormous and same time acquiring such information for various analyses and decisions is growing in every field. So, it is significant to integrate the related information based on their similarity. But the existing integration techniques are usually having processing and time complexity and even having constraints in interconnecting multiple data sources. Many of these sources of information come from a variety of sources. Due to the complex distribution of many different data sources, it is difficult to determine the relationship between the data, and it is difficult to study the same data structures for integration to effectively access or retrieve data to meet the needs of different data analysis. In this paper, proposed an integration of big data with computation of attribute conditional dependency (ACD) and similarity index (SI) methods termed as ACD-SI. The ACD-SI mechanism allows using of an improved Bayesian mechanism to analyze the distribution of attributes in a document in the form of dependence on possible attributes. It also uses attribute conversion and selection mechanisms for mapping and grouping data for integration and uses methods such as LSA (latent semantic analysis) to analyze the content of data attributes to extract relevant and accurate data. It performs a series of experiments to measure the overall purity and normalization of the data integrity, using a large dataset of bibliographic data from various publications. The obtained purity and NMI ratio confined the clustered data relevancy and the measure of precision, recall, and accurate rate justified the improvement of the proposal is compared to the existing approaches.

References

[1]	J. Brockmeier, T. Mu, S. Ananiadou, J. Y. Goulermas, Self-tuned descriptive document clustering using a predictive network, IEEE Trans. Knowl. Data Eng., 30 (2018), 1929-1942. doi: 10.1109/TKDE.2017.2781721
[2]	W. Hua, Z. Wang, H. Wang, K. Zheng, X. Zhou, Understand short texts by harvesting and analyzing semantic knowledge, IEEE Trans. Knowl. Data Eng., 29 (2017), 499-512. doi: 10.1109/TKDE.2016.2571687
[3]	H. Jaber, F. Marle, M. Jankovic, Improving collaborative decision making in new product development projects using clustering algorithms, IEEE Trans. Eng. Manage., 62 (2015), 475-483. doi: 10.1109/TEM.2015.2458332
[4]	K. Yu, L. Tan, L. Lin, X. Cheng, Z. Yi, T. Sato, Deep-learning-empowered breast cancer auxiliary diagnosis for 5GB remote e-health, IEEE Wirel. Commun., 28 (2021), 54-61. doi: 10.1109/MWC.001.2000374
[5]	T. Iwata, T. Hirao, N. Ueda, Topic models for unsupervised cluster matching, IEEE Trans. Knowl. Data Eng., 30 (2018), 786-795. doi: 10.1109/TKDE.2017.2778720
[6]	W. Wang, N. Kumar, J. Chen, Z. Gong, X. Kong, W. Wei, et al., Realizing the potential of internet of things for smart tourism with 5G and AI, IEEE Network, 34 (2020), 295-301. doi: 10.1109/MNET.011.2000250
[7]	Y. Zhang, Y. Sun, R. Jin, K. Lin, W. Liu, High-performance isolation computing technology for smart iot healthcare in cloud environments, IEEE Internet Things J., (2021).
[8]	W. Wang, X. F. Zhao, Z. G. Gong, Z. K. Chen, N. Zhang, W. Wei, An attention-based deep learning framework for trip destination prediction of sharing bike, IEEE Trans. Intell. Transp. Syst., 22 (2020), 4601-4610.
[9]	T. Nguyen, V. N. Huynh, A k-means-like algorithm for clustering categorical data using an information theoretic-based dissimilarity measure, in Folks, Spring, (2016), 15-130.
[10]	L. Tan, K. Yu, F. Ming, X. Cheng, G. Srivastava, Secure and resilient artificial intelligence of things: a honeynet a roach for threat detection and situational awareness, IEEE Consum. Electr. Mag., (2021).
[11]	Z. Li, Jing Liu, Yi Yang, X. Zhou, H. Lu, Clustering-guided sparse structural learning for unsupervised feature selection, IEEE Trans. Knowl. Data Eng., 26 (2013), 2138-2150.
[12]	Y. Yang, H. T. Shen, Z. Ma, Z. Huang, X. Zhou, L2, 1-norm regularized discriminative feature selection for unsupervised learning, in Proceedings of the 22nd International Joint Conference on Artificial Intelligence, Spring, (2011), 1589-1594.
[13]	W. Fan, N. Bouguila, D. Ziou, Unsupervised hybrid feature extraction selection for high-dimensional non-Gaussian data clustering with variation inference, IEEE Trans. Knowl. Data Eng., 25(2012), 1670-685.
[14]	H. A. Mahmoud, A. Aboulnaga, Schema clustering and retrieval for multi-domain pay-as-you-go data integration systems, in Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, (2010), 411-422.
[15]	A. Gani, A. Siddiqa, S. Shamshirband, F. Hanum, A survey on indexing techniques for big data: taxonomy and performance evaluation, Knowl. Inf. Syst., 46 (2016), 241-284. doi: 10.1007/s10115-015-0830-y
[16]	F. Amato, A. De Santo, F. Gargiulo, V. Moscato, F. Persia, A. Picariello, et al., Semtree: an index for supporting semantic retrieval of documents, in 2015 31st IEEE International Conference in Data Engineering Workshops (ICDEW), (2015), 62-67.
[17]	C. Liu, R. Ranjan, X. Zhang, C. Yang, D. Georgakopoulos, J. Chen, Public auditing for big data storage in cloud computing a survey, in IEEE 16th International Conference on Computational Science and Engineering, (2013), 1128-1135.
[18]	J. Wang, S. Wu, H. Gao, J. Li, B. C. Ooi, Indexing multi-dimensional data in a cloud system, in Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, (2010), 591-602.
[19]	B. B. Cambazoglu, E. Kayaaslan, S. Jonassen, C. Aykanat, A term-based inverted index partitioning model for efficient distributed query processing, ACM Trans. Web, 7 (2013), 1-23.
[20]	Z. Li, Y. Yang, J. Liu, X. Zhou, H. Lu, Unsupervised feature selection using nonnegative spectral analysis, in Proceedings of the 26th AAAI Conference on Artificial Intelligence, (2012), 1026-1032.
[21]	L. Wolf, A. Shashua, Feature selection for unsupervised and supervised inference: The emergence of sparsity in a weight-based a roach, J. Mach. Learn. Res., 6 (2005), 1855-1887.
[22]	B. Jiang, J. Pei, Y. Tao, X. Lin, Clustering uncertain data based on probability distribution similarity, IEEE Trans. Knowl. Data Eng., 25 (2013), 751-763. doi: 10.1109/TKDE.2011.221
[23]	D. S. Rajput, S. M. Basha, Q. Xin, T. R. Gadekallu, R. Kaluri, K. Lakshmanna, et al., Providing diagnosis on diabetes using cloud computing environment to the people living in rural areas of India, J. Amb. Intel. Hum. Comp., 4 (2021), 1-12.
[24]	K. Yu, Z. Guo, Y. Shen, W. Wang, J. C. Lin, T. Sato, Secure artificial intelligence of things for implicit group recommendations, IEEE Int.Things J., 8 (2021).
[25]	Y. Guan, M. I. Jordan, J. G. Dy, A unified probabilistic model for global and local unsupervised feature selection, in International Conference on Machine Learning, (2011), 1073-1080.
[26]	A. Duric, F. Song, Feature selection for sentiment analysis based on content and syntax models, Decis. Support Syst., 53 (2012), 704-711. doi: 10.1016/j.dss.2012.05.023
[27]	T. Do, D. Lam, T. Huynh, A framework for integrating bibliographical data of computer science publications, in 2014 International Conference on Computing, Management and Telecommunications, (2014), 245-250.
[28]	T. Huynh, H. Luong, K. Hoang, Integrating bibliographical data of computer science publications from online digital libraries, in Asian Conference on Intelligent Information and Database Systems, Springer, (2012), 226-235.
[29]	K. W. Lim, W. Buntine, Bibliographic analysis with the citation network topic model, in Asian conference on machine learning, (2015), 142-158.
[30]	S. A. Salloum, M. Emran, A. A. Monem, K. Shaalan, Using text mining techniques for extracting information from research articles, in Intelligent Natural Language Processing: Trends and Alications, Spring, (2018), 373-397.
[31]	R. Zhao, K. Mao, Fuzzy bag-of-words model for document representation, IEEE Trans. Fuzzy Syst., 26 (2018), 794-804. doi: 10.1109/TFUZZ.2017.2690222
[32]	V. V. Kolisetty, D. S. Rajput, A review on the significance of machine learning for data analysis in big data, in Jordanian Journal of Computers and Information Technology (JJCIT), (2020).
[33]	N. Ayat, H. Afsarmanesh, R. Akbarinia, P. Valduriez, Uncertain data integration using functional dependencies, Amsterdam: Informatics Institute, University of Amsterdam, (2012).
[34]	A. Kadadi, R. Agrawal, C. Nyamful, R. Atiq, Challenges of data integration and interoperability in big data, in IEEE International Conference on Big Data, (2014), 38-40.
[35]	X. Pei, C. Chen, W. Gong, Concept factorization with adaptive neighbors for document clustering, IEEE Trans. Neur. Net. Lear. Syst., 29 (2018), 343-352. doi: 10.1109/TNNLS.2016.2626311
[36]	J. Wu, H. Liu, H. Xiong, J. Cao, J. Chen, K-means-based consensus clustering: a unified view, IEEE Trans. Knowl. Data Eng., 27 (2015), 155-169. doi: 10.1109/TKDE.2014.2316512
[37]	J. Zhu, K. Wang, Y. Wu, Z. Hu, H. Wang, Mining user-aware rare sequential topic patterns in document streams, IEEE Trans. Knowl. Data Eng., 28 (2016), 1790-1804. doi: 10.1109/TKDE.2016.2541149
[38]	X. He, D. Cai, P. Niyogi, Laplacian score for feature selection, in Proceedings of the 18th International Conference on Neural Information Processing Systems, (2005), 507-514.
[39]	G. T. Reddy, M. P. K. Reddy, K. Lakshmanna, R. Kaluri, D. S. Rajput, G. Srivastava, et al., Analysis of dimensionality reduction techniques on big data, IEEE Access, 8 (2020), 54776-54788. doi: 10.1109/ACCESS.2020.2980942
[40]	D. Cai, C. Zhang, X. He, Unsupervised feature selection for multi-cluster data, in Proceedings of the 16th ACM SIGKDD International Conference of Knowledge Discovery and Data Mining, (2020), 333-342.
[41]	S. M. Basha, D. S. Rajput, A supervised aspect level sentiment model to predict overall sentiment on tweeter documents, Int. J. Metadata Semantics Ontologies, 13 (2018), 33-41. doi: 10.1504/IJMSO.2018.096451
[42]	J. P. Mei, Y. Wang, L. Chen, C. Miao, Large scale document categorization with fuzzy clustering, IEEE Trans. Fuzzy Syst., 25 (2016), 1239-1251.
[43]	M. L. Zhang, Lei Wu, LIFT: multi-label learning with label-specific features, IEEE Trans. Pattern Anal. Mach. Intell., 37 (2014), 107-120.
[44]	Z. Zhao and H. Liu, Spectral feature selection for supervised and unsupervised learning, in Proceedings of the 24th international conference on Machine learning, (2007), 1151-1157.
[45]	X. Li, Y. Pang, Deterministic column-based matrix decomposition, IEEE Trans. Knowl. Data Eng., 22 (2009), 145-149.
[46]	E. Brodley, J. G. Dy, Feature selection for unsupervised learning, J. Mach. Learni. Res., 5 (2004), 845-889.
[47]	A. M. Almalawi, A. Fahad, Z. T. Muhammad, A. Cheema, I. Khalil, kNNVWC: An efficient k-nearest neighbors a roach based on various-widths clustering, IEEE Trans. Knowl. Data Eng., 28 (2016), 68-81. doi: 10.1109/TKDE.2015.2460735
[48]	D. Ienco, R. G. Pensa, R. Meo, From context to distance: learning dissimilarity for categorical data clustering, ACM Trans. Knowl. Discov. Data, 6 (2012), 1-25.
[49]	O. M. San, V. N. Huynh, Y. Nakamori, An alternative extension of the k-means algorithm for clustering categorical data, Int. J. Ap. Mat. Comp. Sci., 14 (2004), 241-247.
[50]	L. Chen, Q. Jiang, S. Wang, Model-based method for projective clustering, IEEE Trans. Knowl. Data Eng., 24 (2012), 1291-1305. doi: 10.1109/TKDE.2010.256
[51]	Natthakan I. On, T. Boongeon, S. Garrett, C. Price, A link-based cluster ensemble a roach for categorical data clustering, Knowl. Data Eng., 24 (2012), 413-425. doi: 10.1109/TKDE.2010.268
[52]	J. Tang, X. Hu, H. Gao, H. Liu, Discriminat analysis for unsupervised feature selection, in Proceedings of the SIAM International Conference on Data Mining, (2014), 938-946.
[53]	Y. Tao, R. Cheng, X. Xiao, W. K. Ngai, B. Kao, S. Prabhakar, Indexing multi-dimensional uncertain data with arbitrary probability density functions, in Proceedings of the 31st international conference on VLDB, (2015), 922-933.
[54]	X. He, M. Ji, C. Zhang, H. Bao, A variance minimization criterion to feature selection using Laplacian regularization, IEEE Trans. Pattern Anal. Mach. Intell., 33 (2011), 2013-2025. doi: 10.1109/TPAMI.2011.44
[55]	L. Xiang, G. Zhao, Q. Li, W. Hao, F. Li, TUMK-ELM: a fast unsupervised heterogeneous data learning a roach, IEEE Access, 6 (2018), 35305-35315. doi: 10.1109/ACCESS.2018.2847037

Reader Comments

Your name:*

Email:*
© 2021 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)